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SSLShader: Cheap SSL Acceleration with Commodity Processors 


Keon Jang*, Sangjin Han*, Seungyeop Han’, Sue Moon*, and KyoungSoo Park* 
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Abstract 


Secure end-to-end communication is becoming increas- 
ingly important as more private and sensitive data is 
transferred on the Internet. Unfortunately, today’s SSL 
deployment is largely limited to security or privacy- 
critical domains. The low adoption rate is mainly at- 
tributed to the heavy cryptographic computation over- 
head on the server side, and the cost of good privacy on 
the Internet is tightly bound to expensive hardware SSL 
accelerators in practice. 

In this paper we present high-performance SSL accel- 
eration using commodity processors. First, we show that 
modern graphics processing units (GPUs) can be easily 
converted to general-purpose SSL accelerators. By ex- 
ploiting the massive computing parallelism of GPUs, we 
accelerate SSL cryptographic operations beyond what 
state-of-the-art CPUs provide. Second, we build a trans- 
parent SSL proxy, SSLShader, that carefully leverages 
the trade-offs of recent hardware features such as AES- 
NI and NUMA and achieves both high throughput and 
low latency. In our evaluation, the GPU implementation 
of RSA shows a factor of 22.6 to 31.7 improvement over 
the fastest CPU implementation. SSLShader achieves 
29K transactions per second for small files while it trans- 
fers large files at 13 Gbps on a commodity server ma- 
chine. These numbers are comparable to high-end com- 
mercial SSL appliances at a fraction of their price. 


1 Introduction 


Secure Sockets Layer (SSL) and Transport Layer Secu- 
rity (TLS) have served as de-facto standard protocols 
for secure transport layer communication for over 15 
years. With endpoint authentication and content encryp- 
tion, SSL delivers confidential data securely and prevents 
eavesdropping and tampering by random attackers. On- 
line banking, e-commerce, and Web-based email sites 
typically employ SSL to protect sensitive user data such 
as passwords, credit card information, and private con- 
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tent. Operating atop the transport layer, SSL is used for 
various application protocols such as HTTP, SMTP, FTP, 
XMPP, and SIP, just to name a few. 

Despite its great success, today’s SSL deployment is 
largely limited to security-critical domains or enterprise 
applications. A recent survey shows that the total num- 
ber of registered SSL certificates is slightly over one 
million [18], reflecting less than 0.5% of active Internet 
sites [19]. Even in the SSL-enabled sites, SSL is often 
enforced only for a fraction of activities (e.g., password 
submission or billing information). For example, Web- 
based email sites such as Windows Live Hotmail! and 
Yahoo! Mail* do not support SSL for the content, mak- 
ing the private data vulnerable for sniffing in untrusted 
wireless environments. Popular social networking sites 
such as Facebook? and Twitter? allow SSL only when 
users make explicit requests with a noticeable latency 
penalty. In fact, few sites listed in Alexa top 500 [2] 
enable SSL by default for the entire content. 

The low SSL adoption is mainly attributed to its heavy 
computation overhead on the server side. The typical 
processing bottleneck lies in the key exchange phase 
involving public key cryptography [22, 29]. For in- 
stance, even the latest CPU core cannot handle more 
than 2K SSL transactions per second (TPS) with 1024- 
bit RSA while the same core can serve over 10K plain- 
text HTTP requests per second. As a workaround, high- 
performance SSL servers often distribute the load to a 
cluster of machines [52] or offload cryptographic opera- 
tions to dedicated hardware proxies [3, 4, 6, 13] or accel- 
erators [9, 10, 14, 15]. Consequently, user privacy in the 
Internet still remains an expensive option even with the 
modern processor innovation. 

Our goal is to find a practical solution with commodity 
processors to bring the benefits of SSL to all private In- 
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ternet communication. In this paper, we present our ap- 
proach in two steps. First, we exploit commodity graph- 
ics processing units (GPUs) as high-performance crypto- 
graphic function accelerators. With hundreds of stream- 
ing processing cores, modern GPUs execute the code 
in the single-instruction-multiple-data (SIMD) fashion, 
providing ample computation cycles and high memory 
bandwidth to massively parallel applications. Through 
careful algorithm analysis and parallelization, we accel- 
erate RSA, AES and SHA-1 cryptographic primitives 
with GPUs. Compared with previous GPU approaches 
that take hundreds of milliseconds to a few seconds to 
reach the peak RSA performance [37,56], our implemen- 
tation produces the maximum throughput with one or 
two orders of magnitude smaller latency, which is well- 
suited for interactive Web environments. 


Second, we build SSLShader, a GPU-accelerated SSL 
proxy that transparently handles SSL transactions for 
existing network servers. SSLShader selectively of- 
floads cryptographic operations to GPUs to achieve high 
throughput and low latency depending on the load level. 
Moreover, SSLShader leverages the recent hardware fea- 
tures such as multi-core CPUs, the non-uniform memory 
access (NUMA) architecture, and the AES-NI instruc- 
tion set. 


Our contributions are summarized as follows: 


(i) We provide detailed algorithm analysis and paral- 
lelization techniques to scale the performance of RSA, 
AES and SHA-1 in GPUs. To the best of our knowl- 
edge, our GPU implementation of RSA shows the high- 
est throughput reported so far. On a single NVIDIA 
GTX580 card, our implementation shows 92K RSA op- 
erations/s for 1024-bit keys, a factor of 27 better perfor- 
mance over the fastest CPU implementation with a single 
2.66 GHz Intel Xeon core. 


(it) We introduce opportunistic workload offloading 
between CPU and GPU to achieve both low latency and 
high throughput. When lightly loaded, SSLShader uti- 
lizes low-latency cryptographic code execution by CPUs, 
but at high load it batches and offloads multiple crypto- 
graphic operations to GPUs. 


(uit) We build and evaluate a complete SSL proxy sys- 
tem that exploits GPUs as SSL accelerators. Unlike prior 
GPU work that focuses on microbenchmarks of crypto- 
graphic operations, we focus on systems interaction in 
handling the SSL protocol. SSLShader achieves 13 Gbps 
SSL large-file throughput handling 29K SSL TPS ona 
single machine with two hexa-core Intel Xeon 5650’s. 


The rest of the paper is organized as follows. In Sec- 
tion 2, we provide a brief background on SSL, popular 
cryptographic operations, and the modern GPU. In Sec- 
tions 3 and 4 we explain our optimization techniques for 
RSA, AES and SHA-1 implementations in a GPU. In 
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Figure 1: SSL handshake and data 


Sections 5 and 6, we show the design and evaluation of 
SSLShader. In Sections 7 and 8 we discuss related work 
and conclude. 


2 Background 


In this section, we provide a brief introduction to 
SSL and describe the cryptographic algorithms used in 
TLS_RSA_WITH_AES_128_CBC_SHA, one of the most 
popular SSL cipher suites. We also describe the ba- 
sic architecture of modern GPUs and strategies to ex- 
ploit them for cryptographic operations. In this paper 
we use TLS_RSA_WITH_AES_128_CBC_SHA as a ref- 
erence cipher suite, but we believe our techniques to be 
easily applicable to other similar algorithms. 


2.1 Secure Sockets Layer 


SSL was developed by Netscape in 1994 and has been 
widely used for secure transport layer communication. 
SSL provides three important security properties in pri- 
vate communication: data confidentiality, data integrity, 
and end-point authentication. From SSL version 3.0, the 
official name has changed to TLS and the protocol has 
been standardized by IETF. SSL and TLS share the same 
protocol structure, but they are incompatible, since they 
use different key derivation functions to generate session 
and message authentication code (MAC) keys. 

Figure | describes how the SSL protocol works. A 
client sends a ClientHello message to the target server 
with a list of supported cipher suites and a nonce. The 
server picks one (asymmetric cipher, symmetric cipher, 
MAC algorithm) tuple in the supported cipher suites, and 
responds with a ServerHello message with the chosen c1- 
pher suite, its own certificate and a server-side nonce. 
Upon receiving the ServerHello message, the client ver- 
ifies the server’s certificate, generates a pre-master se- 
cret and encrypts it with the server’s public key. The en- 
crypted pre-master secret is delivered to the server, and 
both parties independently generate two symmetric ci- 
pher session keys and two MAC keys using a predefined 
key derivation function with the pre-master key and the 
two nonces as input. Each (session, MAC) key pair is 
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used for encryption and MAC generation for one direc- 
tion (e.g., client to server or server to client). 

In the Web environment where most objects are small, 
the typical SSL bottleneck lies in decrypting the pre- 
master secret with the server-side private key. The client- 
side latency could increase significantly if the server is 
overloaded with many SSL connections. When the size 
of an object is large, the major computation overhead 
shifts to symmetric cipher execution and MAC calcula- 
tion. 


2.2 Cryptographic Operations 


TLS_RSA_WITH_AES_128_CBC_SHA uses RSA, AES, 
and a Secure Hash Algorithm (SHA) based HMAC. Be- 
low we sketch out each cryptographic operation. 


2.2.1 RSA 


RSA [53] is an asymmetric cipher algorithm widely used 
for signing and encryption. To encrypt, a plaintext mes- 
sage is first transformed into an integer M, then turned 
into a ciphertext C with: 


C := M* modn (1) 


with a public key (n, e). Decryption with a private key 
(n, d) can be done with 


M:=C? modn (2) 


C, M, d, and n are k-bit large integers, typically 1,024, 
2,048, or even 4,096 bits (or roughly 300, 600, or 1,200 
decimal digits). Since e is chosen to be a small number 
(common choices are 3, 17, and 65,537), public key en- 
cryption is 20 to 60 times faster than private key decryp- 
tion. RSA operations are compute-intensive, especially 
for SSL servers. Because servers perform expensive pri- 
vate key decryption for each SSL connection, handling 
many concurrent connections from clients is a challenge. 
In this paper we focus on private key RSA decryption, 
the main computation bottleneck on the server side. 


2.2.2 AES 


Advanced Encryption Standard (AES) [32] is a popular 
symmetric block cipher algorithm in SSL. AES divides 
plaintext message into 128-bit fixed blocks and encrypts 
each block into ciphertext with a 128, 192, or 256-bit 
key. The encryption algorithm consists of 10, 12, or 
14 rounds of transformations depending on the key size. 
Each round uses a different round key generated from the 
original key using Rijndael’s key schedule. 

We implement AES encryption and decryption in 
cipher-block chaining (CBC) mode. In CBC mode, each 
plaintext block is XORed with a random block of the 
same size before encryption. The i-th block’s random 
block is simply the (i — 1)-th ciphertext block, and the 
initial random block, called the Initialization Vector (IV), 


is randomly generated and is sent in plaintext along with 
the encrypted message for decryption. 


2.2.3 HMAC 


Hash-based Message Authentication Code (HMAC) is 
used for message integrity and authentication. HMAC 
is defined as 


HMAC(k,m) = H((k@opad)||H((k @ipad)||m)) (3) 


His ahash function, k is a key, m is a message, and ipad 
and opad are predefined constants. Any hash function 
can be combined with HMAC and we use SHA-1 as it is 
the most popular. 


2.3 GPU 


Modern GPUs have hundreds of processing cores that 
can be used for general-purpose computing beyond 
graphics rendering. Both NVIDIA and AMD provide 
convenient programming libraries to use their GPUs for 
computation or memory-intensive applications. We use 
NVIDIA GPUs here, but our techniques are applicable to 
AMD GPUs as well. 

A GPU executes code in the SIMD fashion that shares 
the same code path working on multiple data at the same 
time. For this reason, a GPU is ideal for parallel appli- 
cations requiring high memory bandwidth to access dif- 
ferent sets of data. The code that the GPU executes is 
called a kernel. To make full use of massive cores in a 
GPU, many threads are launched and run concurrently 
to execute the kernel code. This means more parallelism 
generally produces better utilization of GPU resources. 

GPU kernel execution takes the following four steps: 
(i) the DMA controller transfers input data from host 
memory to GPU (device) memory; (ii) a host program 
instructs the GPU to launch the kernel; (iii) the GPU ex- 
ecutes threads in parallel; and (iv) the DMA controller 
transfers the result data back to host memory from de- 
vice memory. 

The latest NVIDIA GPU its the GTX580, codenamed 
Fermi [20]. It has 512 cores consisting of 16 Stream- 
ing Multiprocessors (SMs), each of which has 32 Stream 
Processors (SPs or CUDA cores). In each SM, 48 KB 
shared memory (scratchpad RAM), 16 KB LI cache, 
and 32,768 4-byte registers allow high-performance pro- 
cessing. To hide the hardware details, NVIDIA provides 
Compute Unified Device Architecture (CUDA) libraries 
to software programmers. CUDA libraries allow easy 
programming for general-purpose applications. More 
details about the architecture can be found in [47,48]. 

The fundamental difference between CPUs and GPUs 
comes from how transistors are composed in the pro- 
cessor. A GPU devotes most of its die area to a large 
array of Arithmetic Logic Units (ALUs). In contrast, 
most CPU resources serve a large cache hierarchy and 
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a control plane for sophisticated acceleration of a single 
thread (e.g., out-of-order execution, speculative loads, 
and branch prediction), which are not much effective 
in cryptography. Our key insight of this work is that 
compute-intensive cryptographic operations can benefit 
from the abundant ALUs in a GPU, given enough paral- 
lelism (intra- and inter-flow). 


3 Optimizing RSA for GPU 


For RSA implementation on GPUs, the main challenge 
is to achieve high throughput while keeping the la- 
tency low. Naive porting of CPU algorithms to a GPU 
would cause severe performance degradation, wasting 
most GPU computational resources. Since a single GPU 
thread runs at 10x to 100x slower speed than a CPU 
thread, the naive approach would yield unacceptable la- 
tency. 

In this section, we describe our approach and design 
choices to maximize performance of RSA decryption on 
GPUs. The key point in maximizing RSA performance 
lies in high parallelism. We exploit parallelism in the 
message level, in modular exponentiation, and finally in 
the word-size modular multiplication. We show that our 
parallel Multi-Precision (MP) algorithm obtains a signif- 
icant gain in throughput and curbs latency increase to a 
reasonable level. 


3.1 How to Parallelize RSA Operations? 


Our main parallelization idea is to batch multiple RSA 
ciphertext messages and to split those messages into 
thousands of threads so that we can keep all GPU cores 
busy. Below we give a brief description of each level. 


Independent Messages: At the coarsest level, we pro- 
cess multiple messages in parallel. Each message is in- 
herently independent of other messages; no coordination 
between threads belonging to different messages is re- 
quired. 


Chinese Remainder Theorem (CRT): For each mes- 
sage, (2) can be broken into two independent modular 
exponentiations with CRT [51]. 


M, =c?™4 (P-1) mod P (4a) 
My = C4™4(9-!) mod g (4b) 


where p and g are k/2-bit prime numbers chosen in pri- 
vate key generation (n = p x q). All four parameters, p, 
g, d mod (p—1), andd mod (q-—1), are part of the RSA 
private key [38]. 

With CRT, we perform the two k/2-bit modular expo- 
nentiations in parallel. Each of which requires roughly 
8 times less computation than k-bit modular exponenti- 
ation. Obtaining M from M, and M) adds only small 
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overheads, compared to the gain from two k/2-bit mod- 
ular exponentiations. 


Large Integer Arithmetic: Since the word size of a 
computer is usually 32 or 64-bit, large integers must 
be broken into small multiple words. We can run mul- 
tiple threads, each of which processes a word. How- 
ever, we need carry-borrow processing or base extension 
in order to coordinate the outcome of per-word opera- 
tions between threads. We consider two algorithms, stan- 
dard Multi-Precision (MP) and Residue Number System 
(RNS), to represent and compute large integers. These 
algorithms are commonly used in software and hardware 
implementations of RSA. 


3.2 Optimization Strategies 


In our MP implementation we exploit the following two 
optimization strategies: (7) reducing the number of mod- 
ular multiplications with the Constant Length Nonzero 
Windows (CLNW) partitioning algorithm; (ii) adopting 
Montgomery’s reduction algorithm to improve the effi- 
ciency of each modular multiplication routine performed 
at each step of the exponentiation. These optimization 
techniques are also helpful for both serial software and 
hardware implementations, as well as for our GPU par- 
allel implementations. 


CLNW: With the binary square-and-multiply method, 
the expected number of modular multiplications is 3k/2 
for k-bit modular exponentiation [41]. For example, the 
expected number of operations for 512-bit modular ex- 
ponentiation (used for 1024-bit RSA with CRT) is 768. 
The number can be reduced with sliding window tech- 
niques that scan multiple bits, instead of individual bits 
of the exponent. 

We have implemented CLNW and reduced the number 
of modular multiplications from 768 to 607, achieving a 
21% improvement [28]. One may instead use the Vari- 
able Length Nonzero Window (VLNW) algorithm [26], 
but it is known that VLNW does not give any perfor- 
mance advantage over CLNW on average [50]. 


Montgomery Reduction: In a modular multiplication 
c =a-bmodn, an explicit k-bit modulo operation fol- 
lowing a naive multiplication should be avoided. Mod- 
ulo operation requires a trial division by modulus n for 
the quotient, in order to compute the remainder. Divi- 
sion by a large divisor is very expensive in both software 
and hardware implementations and is not easily paral- 
lelizable, and thus inappropriate especially for GPUs. 

Montgomery’s algorithm allows a modular multiplica- 
tion without a trial division [45]. Let 


a=a-Rmodn (5) 


be the montgomeritized form of a modulo n, where R and 
n are coprime and n < R. Montgomery multiplication 
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Algorithm 1 MMUL: Montgomery multiplication 


Input: a,b 

Output: a@-b-R~! mod n 

Precomputation: R~! such that R-R~'! =1 (mod n) 
n’ such that R-R7!—n-n!’ =1 

> T<a-b 

M+<T-n' modR 

>: U«(T+M-n)/R 

: ifU >nthen 

return U—n 

: else 

return U 

: end if 


is defined as in Algorithm 1. If we set R to be 2* the 
division and modulo operations with R can be done very 
efficiently with bit shifting and masking. 

Note that the result of Montgomery multiplication of 
a and b is stilla@-b-R~'! mod n= a-b mod n, the mont- 
gomeritized form of a- b. For a modular exponentiation, 
we convert a ciphertext C into C, get C4 with successive 
Montgomery multiplication operations, and invert it into 
C? mod n. In this process, expensive divisions or mod- 
ulo operations with n are eliminated. 

The implementation of Montgomery multiplication 
depends on data structures used to represent large inte- 
gers. Below we introduce our MP implementation. 


3.3. MP implementation 


The standard Multi-Precision algorithm is the most 
convenient way to represent large integers in a com- 
puter [41]. A k-bit integer A is broken into s = [k/w] 
words of a;’s, where i = 1,...,s and w is typically set to 
the bit-length of a machine word (e.g., 32 or 64). Here we 
describe our MP implementation of Montgomery multi- 
plication and various optimization techniques. 


3.3.1 Multiplication 


In Algorithm 1, the multiplication of two s-word integers 
appear three times in lines 1, 2, and 3. The time complex- 
ity of the serial multiplication algorithm that performs a 
shift-and-add of partial products is O(s*) (also known 
as the schoolbook multiplication). Implementation of 
an O(s) parallel algorithm with linear scalability is not 
trivial due to carry processing. We have implemented 
an O(s) parallel algorithm on s processors (threads) that 
works in two phases. In Figure 2, hiword and loword are 
high and low w bits of a 2w-bit product respectively, and 
gray cells represent updated words by s threads. This 
parallelization scheme is commonly used for hardware 
implementation. 

In the first phase, we accumulate s x | partial products 
in 2s steps (s steps for each loword and hiword), ignoring 
any carries. Carries are accumulated in a separate array 
through the processing. Each step is simple enough to be 







4 


hiword of a; b; 
Phase 2: deferred 
carry processing 


Figure 2: Parallel multiplication example of 649 x 627 = 
406,923. For simplicity, a word holds a decimal digit rather 
than w-bit binary in the example. 
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translated into a small number of GPU instructions since 
it does not involve cascading carry propagation. 

The second phase repeatedly adds the carries to the 
intermediate result and renews the carries. This phase 
stops when all carries become 0, which can be checked 
in One instruction with the __any() voting function in 
CUDA [48]. The number of iterations is s— 1 in the 
worst case, but for most cases it takes one or two iter- 
ations since small carries (less than 2s) rarely produce 
additional carries. 

Our simple O(s) algorithm is a significant improve- 
ment over the prior RSA implementation on GPUs. Har- 
rison and Waldron parallelize s x s multiplications as fol- 
lows [37]: Each of s threads independently performs 
s X 1 multiplications in serial. Then s partial products are 
summed up in additive reduction in logn steps, each of 
which is done in serial as well. The resulting time com- 
plexity is O(slogs), and most of the threads are under- 
utilized during the final reduction phase. 

We also implemented RNS-based Montgomery multi- 
plications. We adopt Kawamura’s algorithm [40]. Even 
with extensive optimizations, the RNS implementation 
performs significantly slower than MP, and we use only 
the MP version in this paper. For future reference, we 
point out two main problems that we have encountered. 
First, CUDA does not support native integer division and 
modulo operations, on which the RNS Montgomery mul- 
tiplication heavily depends. We have found that the per- 
formance of emulated operations 1s dependent on the size 
of a divisor and degrades significantly if the length of a 
divisor is longer than 14 bits. Second, since the num- 
ber of threads is not a power of two, warps are not fully 
utilized and array index calculation becomes slow. 
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Figure 3: 1024-bit RSA performance with various optimization techniques. Sub-bars are placed in the same order as the techniques 


shown in Section 3.3.2, except for CLNW. 


3.3.2 Optimizations 


On top of CRT parallelization, CLNW, Montgomery 
reduction, modular exponentiation, and square-and- 
multiply optimization techniques, we conduct further op- 
timizations as below. Figure 3 demonstrates how the 
overall throughput of the system increases as each op- 
timization technique is applied. The naive implementa- 
tion includes CRT parallelization, basic implementation 
of Montgomery multiplication, and square-and-multiply 
modular exponentiation. For a 1024-bit ciphertext mes- 
sage with CRT, each of two 512-bit numbers (a cipher- 
text message) spans across 16 threads, each of which 
holds a 32-bit word, and those 16 threads are grouped 
as a CUDA block. 


(1) Faster Calculation of M -n: In Algorithm 1, the cal- 
culation of M and M -n requires two s x s multiplication 
operations. We reduce these into one s x | and one s x s 
multiplication and interleave them in a single loop. This 
technique was originally introduced in [45], and we ap- 
ply it for the parallel implementation. 


(2) Interleaving T + M -n: We interleave the calculation 
of T+M-n ina single multiplication loop. This opti- 
mization effectively reduces the overhead of loop con- 
struction and carry processing. This technique was used 
in the serial RSA implementation on a Digital Signal 
Processor (DSP) [34], and we parallelize it. 


(3) Warp Utilization: In CUDA, a warp (a group of 32 
threads in a CUDA block), is the basic unit of schedul- 
ing. Having only 16 threads in a block causes under- 
utilization of warps, limiting the performance. We avoid 
this behavior by having blocks be responsible for multi- 
ple ciphertext messages, for full utilization of warps. 


(4) Loop Unrolling: We unrolled the loop in Mont- 
gomery multiplication, by using the #pragma unroll 
feature supported in CUDA. Giving more optimization 
chances to the compiler is more beneficial than in CPU 
programming, due to the lack of out-of-order execution 
capability in GPU cores. 


(5) Elimination of Divergency: Since threads in a warp 
execute the same instruction in lockstep, code-path di- 
vergency in a warp is expensive (all divergent paths must 
be taken in serial). For example, we minimize the diver- 
gency in our code by replacing if statements with flat 
arithmetic operations. 
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(6) Use of 64-bit Words: The native support for inte- 
ger multiplication on GPUs, which 1s the basic building 
block of large integer arithmetic, has recently been added 
and is still evolving. GTX580 supports native single- 
cycle instructions that calculate hiword or loword of the 
product of two 32-bit integers. 

Use of 64-bit words instead of 32-bit introduce a new 
trade-off on GPUs. While the multiplication of two 64- 
bit words takes four GPU cycles [48], it can halve the 
required number of threads and loop iterations depicted 
in Figure 2. We find that this optimization is highly ef- 
fective when applied. 


(7) Avoiding Bank Conflicts: The 64-bit access pattern 
to the intermediate results and carries in Figure 2 causes 
bank conflicts in shared memory between independent 
ciphertext messages in the same warp. We avoid this 
bank conflict by padding the arrays to adjust access pat- 
tern in shared memory. 


(8) Instruction-Level Optimization: We have manually 
inspected and optimized the core code (about 10 lines) 
inside the multiplication loop, which is the most time- 
consuming part in our GPU code. We changed the code 
order at the CUDA C source level, until we got the de- 
sired assembly code. This includes the elimination of re- 
dundant instructions and pipeline stalls caused by Read- 
After-Write (RAW) register dependencies [47]. 


(9) Post-Exponentiation Offloading: Fusion of two par- 
tial modular exponentiation results from (4) is done on 
the CPU with the Mixed-Radix Conversion (MRC) algo- 
rithm as follows [27]: 


M :=Mz+[(Mi—M2)-(q' mod p)|-q_ 6) 


Although this processing is much lighter than modular 
exponentiation operations, the relative cost has become 
significant as we optimize the modular exponentiation 
process extensively. We have offloaded the above equa- 
tion to the GPU, parallelizing at the message level. We 
also offload other miscellaneous processing in decryp- 
tion such as integer-to-octet-string conversion and PKCS 
#1 depadding [38]. 


3.4 RSA Microbenchmarks 


We compare our parallel RSA implementation to a se- 
rial CPU implementation. We use Intel Integrated Per- 
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Figure 4: RSA MP performance on a GTX580. A single core (Xeon X5650 2.66 GHz) is used for CPU performance. 







Processor 512 | 1024 2048 | 4096 
1 


Latency CPU core 0.0 0.3 23 14.9 
GTX580, MP it 3.8 


(ms) 13.83 | 52.46 


Throughput | CPU core 13,924 | 3,301 438 67 
(ops/s) GTX580, MP 906 263 72 19 






Peak CPU core 13,924 | 3,301 438 67 
(ops/s) GTX580, MP | 322,167 | 74,732 | 12,044 | 1,661 


Table 1: RSA performance with various key sizes 


formance Primitives (IPP) [8] as a CPU counterpart. IPP 
is the fastest implementation we have tried, outperform- 
ing other publicly available libraries for all key sizes. It 
performs 3,301 RSA decryption operations/s for a 1024- 
bit key on a 2.66 GHz CPU core. Since this number is 
higher than what Kounavis et al. recently report (2,990 
operations/s on a 3.00 GHz CPU core) in [43], we believe 
its CPU reference implementation is a fair comparison to 
our GPU code. 

Table 1 summarizes the performance of RSA on the 
CPU (a single 2.66 GHz core) and GTX580. With only 
one ciphertext message per launch, the GPU’s perfor- 
mance shows an order of magnitude worse throughput 
(operations per second) and latency (the execution time). 
Given enough parallelism, however, the GPU produces 
much higher throughput than the CPU. The MP imple- 
mentation on the GTX580 shows 23.1x, 22.6x, 27.5x, 
and 31.7x speedup compared with a single CPU core, for 
512-bit, 1024-bit, 2048-bit, and 4096-bit RSA, respec- 
tively. The performance gains are comparable to what 
we expect from three hexa-core CPUs. 

Figure 4 shows the correlation between latency and 
throughput of our MP implementation. The throughput 
improves as the number of concurrent messages grows, 
but reaches a plateau beyond 512 messages. The latency 
increases very slowly, but grows linearly with the number 
of messages beyond the point where the GPU is fully 
utilized. Even at peak throughput the latency stays below 
7 to 13.7 ms for more than 70,000 operations/s for 1024- 
bit RSA decryption on a GTX580 card. 

Many cipher algorithms, such as DSA [5], Diffie Hell- 
man key exchange [33], and Elliptic Curve Cryptography 
(ECC) [42], depend on modular exponentiation as well 


as RSA. Our optimization techniques presented in Sec- 
tion 3 are applicable to those algorithms and can offer an 
efficient platform for their GPU implementation. 

We summarize our RSA implementation on GPUs. 
First, the parallel RSA implementation on a GPU brings 
significant throughput improvement over a CPU. Second, 
we need many ciphertext messages in a batch for full 
utilization of GPUs with enough parallelism, in order to 
take a performance advantage over CPUS. In Section 5.4 
we introduce the concept of asynchronous concurrent 
execution, which allows smaller batch sizes and thus 
shorter queueing and processing latency, while yielding 
even better throughput. Lastly, while the GPU imple- 
mentation shows reasonable latency, it still imposes per- 
ceptible delay for SSL clients. This problem is addressed 
in Section 5.2 with opportunistic offloading, which ex- 
ploits the CPU for low latency when under-loaded and 
offloads to the GPU for high throughput when a suffi- 
cient number of operations are pending. 


4 Accelerating AES and HMAC-SHA1 
4.1 GPU-accelerated AES 


Since CBC mode encryption has a dependency on the 
previous block result, AES encryption in the same flow is 
serialized. On the other hand, decryption can be run con- 
currently as the previous block result is already known 
at decryption time. Therefore, AES-CBC decryption in 
a GPU runs much faster than AES-CBC encryption. 

We have implemented AES for a GPU with the fol- 
lowing optimizations. As shown in [36], on-chip shared 
memory offers two orders of magnitude faster access 
time than global memory on GPU. To exploit this fact, 
at the beginning of the AES cipher function, each thread 
copies part of the lookup table into shared memory. Ad- 
ditionally, we have chosen to derive the round key at each 
round, instead of using pre-expanded keys from global 
memory. Though it incurs more computation overhead, 
it avoids expensive global memory access and reduces 
the total latency. 
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Figure 5: AES and HMAC-SHAI performance on GTX580. A single core (Xeon X5650 2.66 GHz) is used for CPU performance. 


4.2 AKES-NI 


Intel has recently added the AES instruction set (AES- 
NI) to the latest lineup of x86 processors. AES-NI runs 
one round of AES encryption or decryption with a sin- 
gle instruction (AESENC or AESDEC), and its perfor- 
mances for AES-GCM and AES-CTR are 2.5 to 6 times 
faster than a software implementation [7,39]. AES-NI is 
especially useful for handling large files since data trans- 
fer overhead between host and device memory quickly 
becomes the bottleneck for GPU-accelerated AES. How- 
ever, GPU-based symmetric cipher offloading still pro- 
vides a benefit, if 2) CPUs do not support AES-NI, (11) 
CPUs become the bottleneck handling the network stack 
and other server code, or (111) other cipher functions (such 
as RC4 and 3DES) are needed. 


4.3 GPU-accelerated HMAC-SHAI 


The performance of HMAC-SHAI depends on SHAI. 
Thus, we focus on the SHA1 algorithm. SHA1 takes 512 
bits at each round and generates a 20-byte digest. Each 
round uses the previous round’s result; thus SHAI can 
not be run in parallel within a single message. Our SHA1 
optimization in a GPU is divided into two parts: (i) re- 
ducing memory access by processing data in the register, 
and (11) reducing required memory footprint to fit in the 
GPU registers. 

Each round of SHA-1 is divided into four different 
steps, and at each step it processes 20 32-bit words; in 
total, 80 intermediate 32-bit values are used. A typical 
CPU implementation pre-calculates all 80 words before 
processing, which requires a 320-byte buffer. However, 
the algorithm only depends on the previous 16 words 
at any time. We calculate each intermediate data on 
demand, thus reducing the memory requirement to 64 
bytes, which fits into the registers.> 

To avoid global memory allocation, we unroll all loops 
and hardcode the buffer access with constant indices. 
This way the compiler register-allocates all the necessary 


>The idea to reduce the memory footprint is from a Web post 
in an NVIDIA forum: http://forums.nvidia.com/index.php? 
showtopic=102349 
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16 words. With this approach we see about 100% perfor- 
mance improvement over the naive implementation. 


4.4 Microbenchmarks 


Figure 5 compares the throughput and latency results of 
AES and HMAC-SHAI with one GTX580 card and one 
2.66 GHz CPU core. For the CPU implementations, 
we use Intel IPP, which shows the best performance of 
AES and SHA-1I as of writing this paper. We fix the 
flow length to 16 KB, the largest record size in SSL, 
and vary the number of flows from 32 to 4,096. Our 
AES-CBC implementation shows the peak performance 
of 8.8 Gbps and 10.0 Gbps for encryption and decryption 
respectively when we consider the data transfer cost, but 
the numbers go up to 21.9 Gbps and 33.9 Gbps with- 
out the copy cost. AES-NI shows 5 Gbps and 15 Gbps 
even with a single CPU core and thus one or two cores 
would exceed our GPU performance. Our GPU version 
matches 6.5 and 7.4 CPU cores without AES-NI support 
for encryption and decryption. For HMAC-SHAI, our 
GPU implementation shows 31 Gbps with the data trans- 
fer cost and 124 Gbps without, and matches the perfor- 
mance of 9.4 CPU cores. 

Our findings are summarized as follows. (1) AES-NI 
shows the best performance per dollar, (11) the data trans- 
fer cost in GPU reduces the performance by a factor of 
3.39 and 4 in AES and HMAC-SHAI], and (11) the GPU 
helps in offloading HMAC-SHAI and AES workloads 
when CPUs do not support AES-NI. Since a recent hard- 
ware trend shows that the GPU cores are being integrated 
into the CPU [16,54], we expect the impact of the data 
transfer overhead will decrease in the near future. 


5 SSLShader 


We build a scalable SSL reverse proxy, SSLShader, to 
incorporate the high-performance cryptographic opera- 
tions using a GPU into SSL processing. Though proxy- 
ing generally incurs redundant I/O and data copy over- 
heads, we choose transparent proxing because it pro- 
vides the SSL acceleration benefit to existing TCP-based 
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Figure 6: Overview of SSLShader 


servers without code modification. SSLShader interacts 
directly with the SSL clients while communicating with 
the back-end server in plaintext. We assume that the 
SSLShader-to-server communication takes place in a se- 
cure environment, but one can encrypt the back-end traf- 
fic with a shared symmetric key in other cases. 

The design goal of SSLShader is twofold. First, the 
performance should scale well to the number of CPU and 
GPU cores. Second, SSLShader should curb the latency 
to support interactive environments while improving the 
throughput at high load. In this section we outline the 
key design features of SSLShader. 


5.1 Basic Design 


Figure 6 depicts the overall architecture of SSLShader. 
SSLShader is implemented in event-driven threads. To 
scale with multi-core CPUs, it spawns one worker thread 
per CPU core and each thread accepts and processes 
client-side SSL connections independently. Each con- 
nection is accepted and processed by the same thread to 
avoid cache bouncing between CPU cores. SSLShader 
also creates one GPU-interfacing thread per GPU that 
launches GPU kernel functions to offload cryptographic 
operations. 

Each cryptographic operation type (RSA, AES, 
HMAC-SHAI) has its own request input queue per 
worker thread. Cryptographic operations of the same 
type are inserted into the same queue, and are moved 
to a queue in the GPU-interfacing thread when the in- 
put queue size exceeds a certain threshold value. GPU- 
interfacing threads simply offload the requested opera- 
tions in a batch by launching GPU kernels. The results 
are placed back on a per-worker thread output queue so 
that the worker thread can resume the execution of the 
next step in SSL processing. 


5.2. Opportunistic Offloading 


In order to fully exploit the parallelism, we should batch 
multiple cryptographic operations and offload them to 
the GPU. On the GTX580, the peak 1024-bit RSA per- 
formance is achieved when batching 256-512 operations, 
that is, handling 256-512 concurrent SSL connections. 


Cryptographic operation Minimum | Maximum 


RSA (1024-bit) 512 


AES128-CBC Decryption 32 (2,048) 2,048 
AES128-CBC Encryption | 128 (2,048) 2,048 





HMAC-SHA1 


2,088 


Table 2: Thresholds for GPU cryptographic operations per sin- 
gle kernel launch. Numbers in parenthesis are thresholds when 
AES-NI is used. 


While batching generally improves the GPU throughput, 
a naive approach of batching a fixed number of opera- 
tions would increase processing latency when the load 
level is low. 

We propose a simple GPU offloading algorithm that 
reduces response latency when lightly loaded and im- 
proves the overall throughput at high load. When a 
worker thread inserts a cryptographic request to an in- 
put queue, it first checks the number of the same type of 
requests in all workers’ queues, and its minimum batch- 
ing threshold (the number of queued requests required 
for GPU offloading). If the number of requests is above 
the threshold, SSLShader moves all the requests in the 
worker thread queue to a GPU-interfacing thread queue. 
The batching thresholds are determined based on the 
GPU’s throughput. The minimum threshold is set when 
the GPU performs better than a single CPU core, and 
the maximum is set when the maximum throughput is 
achieved. We limit maximum batch size since pushing 
too many requests into a queue in the GPU-interfacing 
thread could result in long delay without throughput 1m- 
provement. The thresholds can be drawn automatically 
from benchmark tests at configuration time. For AES, 
thresholds are different when AES-NI is enabled. If 
AES-NI is available, we set the minimum threshold to 
be the same as the maximum threshold, hoping to ben- 
efit from extra processing power only when the CPU is 
overloaded. Table 2 shows the thresholds we use with 
the GTX580. 

For low latency and high parallelism, the worker 
thread prioritizes I/O events, and processes crypto- 
graphic operations when it has no I/O event. Worker 
threads handle cryptographic operations in the first-in 
first-out (FIFO) manner. We put a timestamp on each 
cryptographic request as it arrives, and use the times- 
tamp to find the earliest arrived operation. The GPU also 
uses FIFO scheduling for processing cryptographic op- 
erations. The GPU-interfacing thread looks at the head 
timestamp of requests by the type, and processes the ear- 
liest request’s type in a batch. Sometimes it takes too 
long for the worker thread to drain the cryptographic 
operations in its queue and this can lead to I/O star- 
vation. To prevent this, we have worker threads peri- 
odically check for I/O events while processing crypto- 
graphic operations. 

We also tested priority-based scheduling by having the 
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Figure 7: Performance improvement from asynchronous concurrent execution with 16 streams, independent CUDA contexts of 
commands that execute in order asynchronously. Each flow size is 16KB for (b) and (c). 


CPU prioritize HMAC-SHAI and AES encryption, and 
the GPU prioritize RSA and AES decryption. This strat- 
egy often improves the peak throughput, but we reject 
this idea because lower-priority cryptographic operations 
could suffer from starvation, and we noticed unstable 
throughput and longer latency in many cases. 


5.3. NUMA-aware GPU Sharing 


NUMA systems are becoming commonplace in server 
machines. In NUMA systems, the communication cost 
between CPU cores varies greatly, depending on the 
number of NUMA hops. For high scalability, it is nec- 
essary to reduce inter-core communication by careful 
NUMA-aware data placement. 

When we use a GPU, we should consider the follow- 
ing issues: (1) GPUs are known to perform badly when 
used by multiple threads simultaneously due to high 
context switching overhead [48]; (1) gathering crypto- 
graphic operations from multiple cores brings more par- 
allelism and helps to exploit the full GPU capacity; and 
(111) Memory access or synchronization across NUMA 
nodes is much more expensive than intra-NUMA node 
operation. For these reasons, we limit the sharing of 
GPUs to the threads in the same NUMA node. 

For intra-NUMA node communication we choose 
threads over processes for faster sharing of the queues as 
offloading cryptographic operations requires data move- 
ment between worker and GPU-interfacing threads. For 
inter-NUMA node communication, we choose processes 
for ease of connection handling without kernel lock con- 
tention at socket creation and closure. 


3.4 Asynchronous Concurrent Execution 


The most recent CUDA device with Compute Capability 
2.0 provides concurrent GPU kernel execution and data 
copy for better utilization of the GPU. On the GTX580, 
up to sixteen different kernels can run simultaneously 
within a single GPU, and copies from device to host 
and host to device can overlap with each other as well 
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as with kernel execution To benefit from concurrent ex- 
ecution and copy, SSLShader launches all GPU trans- 
actions asynchronously. With asynchronous concurrent 
execution, we see up to 1,399%, 731%, and 890% per- 
formance improvements over synchronous execution in 
RSA, AES encryption and HMAC-SHA], respectively. 
Figure 7 depicts the effect of asynchronous concurrent 
execution by varying the batch size. When the batch size 
is small, asynchronous concurrent execution improves 
performance greatly as idle GPU cores can be better uti- 
lized. But even for a large batch size such as 2,048, 
we see 30 ~ 60% performance improvement in HMAC- 
SHAI and AES. The overlap of DMA data copy and 
kernel execution improves the performance even when 
all cores in the GPU are already utilized. In RSA, the 
performance improvement in the batch size of 1024 is 
fairly small compared to those of AES or HMAC-SHAI1 
because the data copy time in RSA is relatively smaller 
than the execution time and the GPU is sufficiently uti- 
lized with large batch sizes. 

We believe our design and implementation strategies 
in this section are not limited to only SSLShader, and can 
be applied to any applications that want to exploit the 
massive parallelism of GPUs. While none of the tech- 
niques are ground-breaking, their combination brings a 
drastic difference in the utilization of GPUs, latency re- 
duction, and throughput improvement. 


6 Evaluation 


In this section we evaluate the effectiveness of 
SSLShader using HTTPS, the most popular protocol that 
uses SSL. We show that SSLShader achieves high perfor- 
mance in connection handling and large-file data transfer 
with small latency overhead. 


6.1 System Configuration 


Our server platform is equipped with two Intel Xeon 
X5650 (hexa-core 2.66 GHz) CPUs, 24 GB memory, 
and two NVIDIA GTX580 cards (512 cores, 1.5 GHz, 
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Figure 8: Transactions per second 


1.5 GB RAM). We install Ubuntu Linux 10.04, NVIDIA 
CUDA Driver v256.40, and Intel ixgbe® driver v2.1.4 
on the server. As back-end web server software, we run 
lighttpd’ vl.4.28 with 12 worker processes to match 
the number of CPU cores. In all experiments, we run 
lighttpd and SSLShader on the same machine. 

We compare SSLShader against lighttpd with 
OpenSSL. For fair comparison, we spent a fair amount 
of time to patch OpenSSL 1.0.0 to use IPP 7.0 which 
has AES-NI support as well as the latest RSA and 
SHA-1 optimizations. We find that IPP 7.0 improves 
the RSA, AES, and HMAC performance by 55%, 10%, 
and 22% respectively from the OpenSSL 1.0.0 default 
implementation. As our goal is to offload SSL compu- 
tation overhead, we focus on static content to prevent 
the back-end web server from becoming a bottleneck. 
To generate HTTP requests, we run the Apache HTTP 
server benchmark tool (ab) [1] on seven 2.66 GHz Intel 
Nehalem quad-core client machines. We modified ab 
to support rate-limiting and to report latency for each 
connection. 


6.2 SSL Handshake Performance 


To evaluate the performance of connection handshake, 
we measure the number of SSL transactions per second 
(TPS) for a small HTTP object (43 bytes including HTTP 
headers). Figure 8 shows the maximum TPS achieved by 
varying the number of concurrent connections. For 1024- 
bit RSA keys, SSLShader achieves 29K TPS, which 
is 2.5 times faster than 11.2K TPS for lighttpd with 
OpenSSL. SSLShader achieves 21.8K TPS, for 2048- 
bit RSA, which is 6 times higher than 3.6K TPS of 
lighttpd. Given that 768-bit RSA was cracked early 
in 2010 [12] and that NIST recommends 2048-bit RSA 
for secure operations as of 2011 [46], the large perfor- 
mance improvement with 2048-bit RSA is significant. 
In SSLShader, the throughput increases as the concur- 
rency increases because the GPU can exploit more par- 
allelism. In 2048-bit RSA, 21.8K is close to the peak 
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Image Name CPU Usage (%) 


Kernel NIC device driver 252 
Kernel (including TCP/IP stack) 60.35 
SSLShader a 
libc (memory copy and others) 9.88 
IPP + libcrypto (cryptographic operations) 12,89 
lighttpd (back-end web server) 4.90 
others 4.35 


Table 3: CPU usage breakdown using oprofile 


throughput of 24.1K msg/s with two GTX580s, mean- 
ing that the GPUs are almost fully utilized. However, 
the performance of RSA 1024-bit is much less than the 
peak throughput of a single GPU, which implies that the 
GPUs have idle computing capacity. 

We run oprofile to analyze the bottleneck for the 
RSA 1024-bit case with 16,000 concurrent clients. Ta- 
ble 3 summarizes where the CPU cycles are spent. We 
see that more than 60% of CPU time is spent in the ker- 
nel for accepting connections and networking I/O; 13% 
of the CPU cycles are spent for cryptographic operations, 
mostly for the Pseudo Random Function (PRF) used for 
generating session keys from the master secret in the 
handshake step. We chose not to offload PRF to GPUs 
because it is run only once in the handshake step and 
its computation overhead is less than 1/10th of the RSA 
decryption overhead. We conclude that the performance 
bottleneck is in the Linux kernel that does not scale to 
multi-core CPUs for connection acceptance, as is also 
observed in [57]. 


6.3 Response Time Distribution 


Naively using a GPU for cryptographic operations could 
lead to high latency when the load level is low. Oppor- 
tunistic offloading guards against this problem, minimiz- 
ing the latency when the load is light and maximizing 
the throughput when the load is high. To evaluate the ef- 
fectiveness of our opportunistic offloading algorithm, we 
measure the response time for both heavy and light load 
cases. We control the load by rate-limiting the clients. 
For lighttpd, we set the limits to 1K TPS for light 
load and 11K TPS for heavy load. For SSLShader, we 
further increase the heavy load limit to 29K TPS. For 
heavy load experiments, we vary the maximum number 
of clients from 1K to 4K. Clients repeatedly request the 
small HTTP objects as in the handshake experiment. 
Figure 9 shows the cumulative distribution functions 
(CDFs) of response times. When the load level is low, 
both lighttpd and SSLShader handle most of the con- 
nections in a few milliseconds (ms), which shows that the 
opportunistic offloading algorithm intentionally uses the 
CPU to benefit from its low latency. SSLShader shows 
a slight increase in response time (2 ms vs. 3 ms on 
median) due to the proxying overhead. At heavy load 
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Figure 10: Bulk transfer throughput 


with 1K concurrent connections, SSLShader’s latency is 
lower than that of lighttpd because CPUs are over- 
loaded and lighttpd produces longer response times. 
In contrast, SSLShader reduces the CPU overhead by 
offloading the majority of cryptographic operations to 
GPUs. SSLShader shows 39 ms and 64 ms for 50’” and 
99" percentiles while lighttpd shows 76 ms and 260 
ms each even at the much lower TPS load level. Even 
if we increase the load with 4K concurrent clients, 70% 
of SSLShader response times remain similar to those of 
lighttpd with 1K clients at the 11K TPS level. 


6.4 Bulk Data Encryption Performance 


We measure bulk data encryption performance by vary- 
ing the file size from 4 KB to 64 MB with and with- 
out AES-NI support, and show the results in Figure 10. 
When AES-NI is enabled, the SSLShader throughput 
peaks at 13 Gbps while lighttpd peaks at 16.0 Gbps. 
We note that increasing the content size above 64 MB 
does not increase lighttpd’s throughput. For contents 
smaller than 4 MB, SSLShader performs 1.3 to 2.2x bet- 
ter than lighttpd while lighttpd shows 1.1x to 1.2x 
better performance for contents larger than 4 MB. As the 
content size grows and throughput increases, the proxy- 
ing overhead increases accordingly, and eventually be- 
comes the performance bottleneck. With oprofile, we 
find that 30% of CPU time is spent on data copying, 
and 20% is spent on handling interrupts for SSLShader, 


NSDI ’11: 8th USENIX Symposium on Networked Systems Design and Implementation 


leaving only 50% for use in cryptographic operation and 
other processing. Without AES-NI, SSLShader achieves 
8.9 Gbps, while lighttpd achieves 9.6 Gbps. The peak 
throughput of SSLShader is slightly lower due to the 
copying overhead as well. 

Considering typical Web objects and email contents 
are smaller than 100 KB [21, 23], we believe that the 
performance gain in small content size and the benefit 
of transparent proxying outweigh the small performance 
loss in large files in many real-world scenarios. Also, 
the GPU is starting to be integrated into the CPU as in 
AMD’s Fusion [16], and we expect that such technology 
will mitigate the performance problem by eliminating the 
data transfer between GPU and CPU. 


7 Discussion & Related Work 


SSL Performance: SSL performance analysis and ac- 
celeration have drawn much attention in the context of 
secure Web server performance. Earlier, Apostolopou- 
los et al. analyzed the SSL performance of Web servers 
using the SpecWeb96 benchmark tool [22]. They ob- 
serve that the SSL-enabled Web servers are up to two 
orders of magnitude slower than plaintext Web servers. 
For small HTTP transactions, the main bottleneck lies 
in SSL handshaking while data encryption takes up sig- 
nificant CPU cycles when the content gets larger. Later, 
Coarfa et al. reported similar results and estimated the 
upper bound in the performance improvement with each 
SSL operation optimization [29]. To improve the SSL 
handshake performance, Boneh et al. proposed client- 
side caching of server certificates to reduce the SSL 
round-trip overhead [25]. Another approach is to pro- 
cess multiple RSA decryptions in a batch using Fiat’s 
batch RSA [55]. They report a 2.5x speedup on their Web 
server experiments by batching four RSA operations. 

Recently, Kounavis et al. improve the SSL perfor- 
mance with general-purpose CPUs [43]. They opti- 
mize the schoolbook big number multiplication and ben- 
efit from AES-NI for symmetric cipher. To reduce the 
CPU overhead for MAC algorithms, they use the Ga- 
lois Counter Mode (GCM) which combines the AES en- 
cryption with the authentication. In comparison, we ar- 
gue that GPUs bring extra computing power in a cost- 
effective manner, especially for RSA and HMAC-SHAI1. 
By parallelizing the schoolbook multiplication and vari- 
ous optimizations, our 1024-bit RSA implementation on 
a GPU shows 30x improvement over their 3.0 GHz CPU 
core. Further, we focus on TLS 1.0 which is widely used 
in practice, whereas GCM is only supported in TLS 1.2, 
which is not popular yet. 


AES Implementations on GPU: Modern GPUs are at- 
tractive for computation-intensive AES operations [30, 
31,36,44,49, 58]. Most GPU-based implementations ex- 
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ploit shared memory and on-demand round key calcula- 
tion to reduce the global memory access. However, we 
find few references that evaluate the AES performance in 
the CBC mode. Unlike electronic codebook (ECB) mode 
or counter (CTR) mode, the CBC mode is hard to paral- 
lelize but is most widely used. Also, most of them report 
the numbers without data copy overhead, but we find the 
copy overhead severely impairs the AES performance. 

Manavski et al. report 8.28 Gbps AES performance on 
GTX 280 (240 cores, 1.296 GHz) [44], while Osvik ef 
al. report 30.9 Gbps on half of a GTX 295 (2 x 240 
cores, 1.242 GHz) [49]. Both of them use the ECB mode 
without data copy overhead. In the same setting, our im- 
plementation shows 32.8 Gbps on GTX 285 (240 cores, 
1.476 GHz). Direct comparison is hard due to differ- 
ent GPUs, but our number is comparable to these results 
(3.48x that of Manavski’s, 0.89x that of Osvik’s) by the 
cycles per byte metric. 


RSA _ Implementations on GPU: Szerwinski and 
Guneysu made the first implementation of RSA on 
the general-purpose GPU computation framework [56]. 
They reported two orders of magnitude lower perfor- 
mance than ours, but it should not be directly compared 
because they used a relatively old NVIDIA 8800GTS 
card with a different GPU architecture. 

Harrison and Waldron report on 1024-bit key RSA 
implementation on an NVIDIA GPU [37], and to the 
best of our knowledge theirs is the fastest implemen- 
tation before our work. They compare serial and MP 
parallel approaches in Montgomery multiplication and 
conclude that the parallel implementation shows worse 
performance at scale due to GPU thread synchronization 
overhead. We have run their serial code on an NVIDIA 
GTX580 card, and found that their peak throughput 
reaches 31,220 operations/s at a latency of 131 ms 
with 4,096 messages per batch. Our throughput on the 
same card shows 74,733 operations/s at a latency of 
13.7 ms with 512 messages per batch, 2.3x improvement 
in throughput and 9.6x latency reduction. 


Comparison with H/W Acceleration Cards: Many 
SSL cards support OpenSSL engine API [11] so that 
their hardware can easily accelerate existing software. 
Current hardware accelerators support 7K to 200K RSA 
operations/s with 1024-bit keys [10, 14]. Our GPU im- 
plementation is comparable with these high-end hard- 
ware accelerators, running at up to 92K RSA opera- 
tions/s at much lower cost. Moreover, GPUs are flexible 
for adoption of new cryptographic algorithms. 


Other Protocols for Secure Communication: Bittau ef 
al. propose tcpcrypt as an extension of TCP for secure 
communication [24]. Tcpcrypt 1s essentially a clean-slate 
redesign of SSL that shifts the decryption overhead by 
private key to clients and that allows a range of authen- 
tication mechanisms. Their evaluation reports 25x better 


connection handling performance when compared with 
SSL. Moreover, tcpcrypt provides forward secrecy by 
default while SSL leaves that as an option. While fix- 
ing the SSL protocol is desirable, we focus on improv- 
ing the current practice of SSL in this work. IPsec [17] 
provides secure communication at the IP layer, which is 
widely used for VPN. IPsec can be more easily paral- 
lelized compared to SSL since many packets can be pro- 
cessed in parallel [35]. 


Performance per $ Comparison: In Table 4, we show 
the price and relative performance to price for two CPUs, 
GTX580, and a popular SSL accelerator card. Intel Xeon 
X5650 and GTX580 are choices for our experiments. 
17 920 has four CPU cores with the same clock speed 
as the X5650 without AES-NI support. We choose the 
CN1620° because it is one of the most cost-effective ac- 
celerators that we have found. Though it is difficult to 
compare the performance per dollar directly (e.g., GPU 
cannot be used without CPU), we present the informa- 
tion here to get the sense of cost effectiveness for each 
hardware. 





Price RSA | AES-ENC | AES-DEC SHAI1 

($) | (ops/sec/$) | (Mbps/$) | (Mbps/$) | (Mbps/$) 

X5650 20.2 
17 920 46.5 
GTX580 62.3 
CN1620 | 2,129 2.8 


Table 4: Performance per $ comparison (price as of Feb. 2011) 


GTX580 shows the best performance per dollar for 
RSA and HMAC-SHA1. For AES operations, X5650 is 
the best with its AES-NI capability, and GTX580 shows 
a slightly better number compared to 17 920. CN1620 
is inefficient in terms of performance per dollar for all 
operations. SSL accelerators typically have much bet- 
ter power efficiency compared to general purpose proces- 
sors and it is mainly used in high-end network equipment 
rather than on server machines. 


S$ Conclusions 


We have enjoyed the security of SSL for over a decade 
and it is high time that we used it for every private In- 
ternet communication. As a cheap way to scale the per- 
formance of SSL, we propose using graphics cards as 
high-performance SSL accelerators. We have presented 
a number of novel techniques to accelerate the crypto- 
graphic operations on GPUs. On top of that, we have 
built SSLShader, an SSL reverse proxy, that opportunis- 
tically offloads cryptographic operations to GPUs and 
achieves high throughput and low response latency. 


8 Model name is CN1620-400-NHB4-E-G and more details are on 
http: //www.scantec-shop.com/cn1620-400-nhb4-e-g-375. html 
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Our evaluation shows that we can scale 1024-bit RSA 
up to 92K operations/s with a single GPU card by care- 
ful workload pipelining. SSLShader handles 29K SSL 
TPS and achieves 13 Gbps bulk encryption throughput 
on commodity hardware. We hope our work pushes SSL 
to a wider adoption than today. 

We report that inefficiency in the Linux TCP/IP stack 
is keeping performance lower than what SSLShader can 
potentially offer. Most of the inefficiency in the Linux 
TCP/IP stack comes from mangled flow affinity and seri- 
alization problems in multi-core systems. We leave these 
issues to future work. 
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Abstract 


As one of the fundamental infrastructures for cloud 
computing, data center networks (DCN) have recently 
been studied extensively. We currently use pure 
software-based systems, FPGA based platforms, e.g., 
NetFPGA, or OpenFlow switches, to implement and 
evaluate various DCN designs including topology de- 
sign, control plane and routing, and congestion control. 
However, software-based approaches suffer from high 
CPU overhead and processing latency; FPGA based plat- 
forms are difficult to program and incur high cost; and 
OpenFlow focuses on control plane functions at present. 


In this paper, we design a ServerSwitch to address the 
above problems. ServerSwitch is motivated by the ob- 
servation that commodity Ethernet switching chips are 
becoming programmable and that the PCI-E interface 
provides high throughput and low latency between the 
server CPU and I/O subsystem. ServerSwitch uses a 
commodity switching chip for various customized packet 
forwarding, and leverages the server CPU for control and 
data plane packet processing, due to the low latency and 
high throughput between the switching chip and server 
CPU. 


We have built our ServerSwitch at low cost. Our ex- 
periments demonstrate that ServerSwitch is fully pro- 
grammable and achieves high performance. Specifically, 
we have implemented various forwarding schemes in- 
cluding source routing in hardware. Our in-network 
caching experiment showed high throughput and flexi- 
ble data processing. Our QCN (Quantized Congestion 
Notification) implementation further demonstrated that 
ServerSwitch can react to network congestions in 23us. 


*This work was performed when Zhigiang Zhou was a visiting stu- 
dent at Microsoft Research Asia. 


1 Introduction 


Data centers have been built around the world for var- 
ious cloud computing services. Servers in data centers 
are interconnected using data center networks. A large 
data center network may connect hundreds of thousands 
of servers. Due to the rise of cloud computing, data cen- 
ter networking (DCN) is becoming an important area of 
research. Many aspects of DCN, including topology de- 
sign and routing [15, 5, 13, 11, 22], flow scheduling and 
congestion control [7, 6], virtualization [14], application 
support [26, 4], have been studied. 

Since DCN 1s a relatively new exploration area, many 
of the designs (e.g., [15, 5, 13, 22, 7, 14, 4]) have de- 
parted from the traditional Ethernet/IP/TCP based packet 
format, Internet-based single path routing (e.g., OSPF), 
and TCP style congestion control. For example, Port- 
land performs longest prefix matching (LPM) on destina- 
tion MAC address, BCube advocates source routing, and 
QCN (Quantized Congestion Notification) [7] uses rate- 
based congestion control. Current Ethernet switches and 
IP routers therefore cannot be used to implement these 
designs. 

To implement these designs, rich programmability 
is required. There are approaches that provide this 
programmability: pure software-based [17, 10, 16] or 
FPGA-based systems (e.g., NetFPGA [23]). Software- 
based systems can provide full programmability and as 
recent progress [10, 16] has shown, may provide a rea- 
sonable packet forwarding rate. But their forwarding rate 
is still not comparable to commodity switching ASICs 
(application specific integrated circuit), and the batch 
processing used in their optimizations introduces high 
latency which is critical for various control plane func- 
tions such as signaling and congestion control [13, 22, 7]. 
Furthermore, the packet forwarding logics in DCN (e.g., 
[15, 13, 22, 14]) are generally simple and hence are 
better implemented in silicon for cost and power sav- 
ings. FPGA-based systems are fully programmable. But 


NSDI 11: 8th USENIX Symposium on Networked Systems Design and Implementation 


16 


the programmability is provided by hardware description 
languages such as Verilog, which are not as easy to learn 
and use as higher-level programming languages such as 
C/C++. Furthermore, FPGAs are expensive and are diffi- 
cult to use in large volumes in data center environments. 

In this paper, we design a ServerSwitch platform, 
which provides easy-to-use programmability, low la- 
tency and high throughput, and low cost. ServerSwitch 
is based on two observations as follows. First, we ob- 
serve that commodity switching chips are becoming pro- 
grammable. Though the programmability is not compa- 
rable to general-purpose CPUs, it is powerful enough to 
implement various packet forwarding schemes with dif- 
ferent packet formats. Second, current standard PCI-E 
interface provides microsecond level latency and tens of 
Gb/s throughput between the I/O subsystem and server 
CPU. ServerSwitch is then a commodity server plus a 
commodity, programmable switching chip. These two 
components are connected via the PCI-E interface. 

We have designed and implemented ServerSwitch. We 
have built a ServerSwitch card, which uses a merchan- 
dise gigabit Broadcom switching chip. The card con- 
nects to a commodity server using a PCI-E X4 interface. 
Each ServerSwitch card costs less than 400$ when man- 
ufactured in 100 pieces. We also have implemented a 
software stack, which manages the card, and provides 
support for control and data plane packet processing. We 
evaluated ServerS witch using micro benchmarks and real 
DCN designs. We built a ServerSwitch based, 16-server 
BCube [13] testbed. We compared the performance of 
software-based packet forwarding and our ServerSwitch 
based forwarding. The results showed that ServerS witch 
achieves high performance and zero CPU overhead for 
packet forwarding. We also implemented a QCN con- 
gestion control [7] using ServerSwitch. The experi- 
ments showed stable queue dynamics and that Server- 
Switch can react to congestion in 23us. 

ServerSwitch explores the design space of combin- 
ing a high performance ASIC switching chip with lim- 
ited programmability with a fully programmable mul- 
ticore commodity server. Our key findings are as fol- 
lows: 1) ServerSwitch shows that various packet for- 
warding schemes including source routing can be of- 
floaded to the ASIC switching chip, hence resulting in 
small forwarding latency and zero CPU overhead. 2) 
With a low latency PCI-E interface, we can implement 
latency sensitive schemes such as QCN congestion con- 
trol, using server CPU with a pure software approach. 
3) The rich programmability and high performance pro- 
vided by ServerSwitch can further enable new DCN ser- 
vices that need in-network data processing such as in- 
network caching [4]. 

The rest of the paper is organized as follows. We elab- 
orate the design goals in § 2. We then present the ar- 
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chitecture of ServerSwitch and our design choices in 8 3. 
We illustrate the software, hardware, and API implemen- 
tations in § 4. 8 5 discusses how we use ServerSwitch to 
implement two real DCN designs, § 6 evaluates the plat- 
form with micro benchmarks and real DCN implemen- 
tations. We discuss ServerSwitch limitations and 10G 
ServerS witch in 8 7. Finally, we present related work in 
8 8 and conclude in 8 9. 


2 Design Goals 


As we have discussed in § 1, the goal of this paper is 
to design and implement a programmable and high per- 
formance DCN platform for existing and future DCN 
designs. Specifically, we have following design goals. 
First, on the data plane, the platform should provide a 
packet forwarding engine that is both programmable and 
achieves high-performance. Second, the platform needs 
to support new routing and signaling, flow/congestion 
control designs in the control plane. Third, the platform 
enables new DCN services (e.g., in-network caching) by 
providing advanced in-network packet processing. To 
achieve these design goals, the platform needs to provide 
flexible programmability and high performance in both 
the data and control planes. It is highly desirable that the 
platform be easy to use and implemented in pure com- 
modity and low cost silicon, which will ease the adoption 
of this platform in a real world product environment. We 
elaborate on these goals in detail in what follows. 
Programmable packet forwarding engine. Packet 
forwarding is the basic service provided by a switch or 
router. Forwarding rate (packet per second, or PPS) is 
one of the most important metrics for network device 
evaluation. Current Ethernet switches and IP routers can 
offer line-rate forwarding for various packet sizes. How- 
ever, recent DCN designs require a packet forwarding en- 
gine that goes beyond traditional destination MAC or IP 
address based forwarding. Many new DCN designs em- 
bed network topology information into server addresses 
and leverage this topology information for packet for- 
warding and routing. For example, PortLand [22] codes 
its fat-tree topology information into device MAC ad- 
dresses and uses Longest Prefix Matching (LPM) over its 
PMAC (physical MAC) for packet forwarding. BCube 
uses source routing and introduces an NHI (Next Hop 
Index, 87.1 of [13]) to reduce routing path length by 
leveraging BCube structural information. We expect 
that more DCN architectures and topologies will appear 
in the near future. These new designs call for a pro- 
grammable packet forwarding engine which can handle 
various packet forwarding schemes and packet formats. 
New routing and signaling, flow/congestion control 
support. Besides the packet forwarding functions in the 
data plane, new DCN designs also introduce new control 
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and signaling protocols in the control plane. For exam- 
ple, to support the new addressing scheme, switches in 
PortLand need to intercept the ARP packets, and redi- 
rect them to a Fabric Manager, which then replies with 
the PMAC of the destination server. BCube uses adap- 
tive routing. When a source server needs to communi- 
cate with a destination server, the source server sends 
probing packets to probe the available bandwidth of mul- 
tiple edge-disjoint paths. It then selects the path with 
the highest available bandwidth. The recent proposed 
QCN switches sample the incoming packets and send 
back queue and congestion information to the source 
servers. The source servers then react to the conges- 
tion information by increasing or decreasing the sending 
rate. All these functionalities require the switches to be 
able to filter and process these new control plane mes- 
sages. Control plane signaling is time critical and sen- 
sitive to latency. Hence switches have to process these 
control plane messages in real time. Note that current 
switches/routers do offer the ability to process the con- 
trol plane messages with their embedded CPUs. How- 
ever, their CPUs mainly focus on management functions 
and are generally lack of the ability to process packets 
with high throughput and low latency. 


New DCN service support by enabling in-network 
packet processing. Unlike the Internet which consists of 
many ISPs owned by different organizations, data centers 
are owned and administrated by a single operator. Hence 
we expect that technology innovations will be adopted 
faster in the data center environment. One such inno- 
vation is to introduce more intelligence into data cen- 
ter networks by enabling in-network traffic processing. 
For example, CamCube [4] proposed a cache service by 
introducing packet filtering, processing, and caching in 
the network. We can also introduce switch-assisted reli- 
able multicast [18, 8] in DCN, as discussed in [26]. For 
an in-network packet processing based DCN service, we 
need the programmability such as arbitrary packet mod- 
ification, processing and caching, which is much more 
than the programmability provided by the programmable 
packet forwarding engine in our first design goal. More 
importantly, we need low overhead, line-rate data pro- 
cessing, which may reach several to tens of Gb/s. 


The above design goals call for a platform which is 
programmable for both data and control planes, and it 
needs to achieve high throughput and low processing 
latency. Besides the programmability and high perfor- 
mance design goals, we have two additional require- 
ments (or constraints) from the real world. First, the 
programmability we provide should be easy to use. Sec- 
ond, it is highly desirable that the platform is built from 
(inexpensive) commodity components (e.g., merchan- 
dise chips). We believe that a platform based on com- 
modity components has a pricing advantage over non- 


commodity, expensive ones. The easy-to-program re- 
quirement ensures the platform is easy to use, and the 
commodity constraint ensures the platform is amenable 
to wide adoption. 

Our study revealed that none of the existing plat- 
forms meet all our design goals and the easy-to-program 
and commodity constraints. The pure software based 
approaches, e.g., Click, have full and easy-to-use pro- 
grammability, but cannot provide low latency packet pro- 
cessing and high packet forwarding rate. FPGA-based 
systems, e.g., NetFPGA, are not as easy to program as 
the commodity servers, and their prices are generally 
high. For example, the price of Virtex-II Pro 50 used 
in NetFPGA is 1,180$ per chip for 100+ chip quantum 
listed on the Xilinx website. Openflow switches provide 
certain programmability for both forwarding and control 
functions. But due to the separation of switches and the 
controller, it is unclear how Openflow can be extended to 
support congestion control and in-network data process- 
ing. 

We design ServerSwitch to meet the three design goals 
and the two practical constraints. ServerSwitch has a 
hardware part and a software part. The hardware part 
is a merchandise switching chip based NIC plus a com- 
modity server. The ServerSwitch software manages the 
hardware and provides APIs for developers to program 
and control ServerSwitch. In the next section, we will de- 
scribe the architecture of ServerSwitch, and how Server- 
Switch meets the design goals and constraints. 


3 Design 


3.1 ServerSwitch Architecture 


Our ServerSwitch architecture is influenced by progress 
and trends in ASIC switching chip and server tech- 
nologies. First, though commodity switches are black 
boxes to their users, the switching chips inside (e.g., 
from Broadcom, Fulcrum, and Marvell) are becoming in- 
creasingly programmable. They generally provide exact 
matching (EM) based on MAC addresses or MPLS tags, 
provide longest prefix matching (LPM) based on IP ad- 
dresses, and have a TCAM (ternary content-addressable 
memory) table. Using this TCAM table, they can pro- 
vide arbitrary field matching. Of course, the width of the 
arbitrary field is limited by the hardware, but is gener- 
ally large enough for our purpose. For example, Broad- 
com Enduro series chips have a maximum width of 32 
bytes, and Fulcrum FM3000 can match up to 78 bytes 
in the packet header [3]. Based on the matching re- 
sult, the matched packets can then be programmed to 
be forwarded, discarded, duplicated (e.g., for multicast 
purpose), or mirrored. Though the programmability 1s 
limited, we will show later that it is already enough for 
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Figure 1: ServerSwitch architecture. 


all packet forwarding functions in existing, and arguably, 
many future DCN designs. 

Second, commodity CPU (e.g., x86 and X64 CPUs) 
based servers now have a high-speed, low latency inter- 
face, i.e., PCI-E, to connect to I/O subsystems such as 
a network interface card (NIC). Even PCI-E 1.0 X4 can 
provide 20Gbps bidirectional throughput and microsec- 
ond latency between the server CPU and NIC. Moreover, 
commodity servers are arguably the best programmable 
devices we currently have. It is very easy to write kernel 
drivers and user applications for packet processing with 
various development tools (e.g., C/C++). 

ServerSwitch then takes advantage of both commod- 
ity servers and merchandise switching chips to meet our 
design goals. Fig. 1 shows its architecture. The hard- 
ware part is an ASIC switching chip based NIC and a 
commodity server. The NIC and server are connected by 
PCI-E. From the figure, we can see there are two PCI-E 
channels. One is for the server to control and program 
the switching chip, the other is for data packet exchange 
between the server and switching chip. 

The software part has a kernel and an application 
component, respectively. The kernel component has 
a switching chip (SC) driver to manage the commod- 
ity switching chip and an NIC driver for the NICs. 
The central part of the kernel component is a Server- 
Switch driver, which sends and receives control mes- 
sages and data packets through the SC and NIC drivers. 
The ServerSwitch driver is the place for various control 
messages, routing, congestion control, and various in- 
network packet processing. The application component 
is for developers. Developers use the provided APIs to 
interface with the ServerSwitch driver, and to program 
and control the switching chip. 
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Our ServerSwitch nicely fulfills all our design goals 
and meets the easy-to-program and commodity con- 
straints. The switching chip provides a programmable 
packet forwarding engine which can perform packet 
matching based on flexible packet fields, and achieve 
full line rate forwarding even for small packet sizes. 
The ServerSwitch driver together with the PCI-E inter- 
face achieves low latency communication between the 
switching chip and server CPU. Hence various rout- 
ing, signaling and flow/congestion controls can be well 
supported. Furthermore, the switch chip can be pro- 
grammed to select specific packets into the server CPU 
for advanced processing (such as in-network caching) 
with high throughput. The commodity constraint is di- 
rectly met since we use only commodity, inexpensive 
components in ServerSwitch. ServerSwitch is easy to 
use since all programming is performed using standard 
C/C++. When a developer introduces a new DCN de- 
sign, he or she needs only to write an application to pro- 
gram the switching chip, and add any needed functions 
in the ServerSwitch driver. 

The ability of our ServerSwitch is constrained by the 
abilities of the switching chip, the PCI-E interface, and 
the server system. For example, we may not be able to 
handle packet fields which are beyond the TCAM width, 
and we cannot further cut the latency between the switch- 
ing chip and server CPU. In practice, however, we are 
still able to meet our design goals with these constraints. 
In the rest of this section, we will introduce the pro- 
grammable packet forwarding engine, the software, and 
the APIs in detail. 


3.2 ASIC-based Programmable 
Forwarding Engine 


Packet 


In this section, we discuss how existing Ethernet switch- 
ing chips can be programmed to support various packet 
forwarding schemes. 

There are three commonly used forwarding schemes 
in current DCN designs, i.e., Destination Address (DA) 
based, tag-based, and Source Routing (SR) based for- 
warding. DA-based forwarding is widely adopted by 
Ethernet and IP networks. Tag-based forwarding decou- 
ples routing from forwarding which makes traffic engi- 
neering easier. SR-based forwarding gives the source 
server ultimate control of the forwarding path and sim- 
plifies the functions in forwarding devices. Table | sum- 
marizes the forwarding primitives and existing DCN de- 
signs for these three forwarding schemes. There are 
three basic primitives to forward a packet, i.e., lookup 
key extraction, key matching, and header modification. 
Note that the matching criteria is independent of the for- 
warding schemes, i.e., a forwarding scheme can use any 
matching criteria. In practice, two commonly used cri- 
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Table 1: Forwarding schemes and primitives. 
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teria are EM and LPM. Next, we describe the three for- 
warding schemes in detail. We start from SR-based for- 
warding. 


3.2.1 Source Routing based Forwarding using 
TCAM 


For SR-based forwarding, there are two approaches de- 
pending on how the lookup key is extracted: indexed and 
non-indexed SR-based forwarding. In both approaches, 
the source fills a series of intermediate addresses (IA) in 
the packet header to define the packet forwarding path. 
For the non-Indexed Source Routing (nISR), the for- 
warding engine always uses the first IA for table lookup 
and pops it before sending the packet. For Indexed 
Source Routing (ISR), there is an index 2 to denote the 
current hop. The engine first reads the index, then ex- 
tracts IA; based on the index, and finally updates the in- 
dex before sending the packet. We focus on ISR support 
in the rest of this subsection. We will discuss nISR sup- 
port in the next subsection since it can be implemented 
as a form of tag-based forwarding. 

ISR-based forwarding uses two steps for lookup key 
extraction. It first gets the index from a fixed location, 
and then extracts the key pointed by the index. However, 
commodity switching chips rarely have the logic to per- 
form this two-step indirect lookup key extraction. In this 
paper, we design a novel solution by leveraging TCAM 
and turning this two-step key extraction into a single step 
key extraction. The TCAM table has many entries and 
each entry has a value and a mask. The mask 1s to set the 
masking bits (‘care’ and ‘do-not-care’ bits) for the value. 

In our design, for each incoming packet, the forward 
engine compares its index field and all IA fields against 
the TCAM table. The TCAM table is set up as follows. 
For each TCAM entry, the index field (2) and the IA; field 
pointed by the index are ‘care’ fields. All other IA fields 
are ‘do-not-care’ fields. Thus, a TCAM entry can simul- 
taneously match both the index and the corresponding 
IA; field. As both index and IA; may vary, we enumerate 
all the possible combinations of index and IA values in 
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Figure 2: Support indexed source routing using TCAM. 


the TCAM table. When a packet comes in, it will match 
one and only one TCAM entry. The action of that entry 
determines the operation on that matched packet. 

Fig. 2 illustrates how the procedure works. The in- 
coming packet has one index field and three IA fields. 
[Az is the lookup key for this hop. In the TCAM table, 
the white fields are the ‘care’ fields and the gray fields 
are the ‘do-not-care’ fields. Suppose there are two pos- 
sible [A addresses and the maximum value of the index 
is three, there are 6 entries in the TCAM table. For this 
incoming packet, it matches the 5th entry where Index=2 
and IA, = 2. The chip then directs the packet to output 
port 2. In § 5.1, we will describe the exact packet format 
based on our ServerSwitch. 

This design makes a trade-off between the requirement 
of extra ASIC logic and the TCAM space. When there 
are n different IA values, the two-step indirect match- 
ing method uses n lookup entries, while this one-step 
method uses n x d entries where d is the maximum value 
of the index. dis always less than or equal to the network 
diameter. Modern switching chips have at least thou- 
sands of TCAM entries, so this one-step method works 
well in the DCN environment. For example, consider 
a medium sized DCN such as a three-level fat-tree in 
Portland. When using 48-port switches, there are 27,648 
hosts. We can use 48 IA values to differentiate these 48 
next hop ports. Since the diameter of the network is 6, 
the number of TCAM entries is 48 x 6 = 288, which is 
much smaller than the TCAM table size. 


3.2.2 Destination and Tag-based Forwarding 


As for the DA-based forwarding, the position of the 
lookup key is fixed in the packet header and the forward- 
ing engine reads the key directly from the packet header. 
No lookup key modification is needed since the destina- 
tion address is a globally unique id. However, the des- 
tination address can be placed anywhere in the packet 
header, so the engine must be able to perform matching 
on arbitrary fields. For example, Portland requires the 
switch to perform LPM on the destination MAC address, 
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whereas DCell uses a self-defined header. 

Tag-based routing also uses direct key extraction, but 
the tag needs to be modified on a per-hop basis since 
the tags have only local meaning. ‘To support this 
routing scheme, the forwarding engine must support 
SWAP/POP/PUSH operations on tags. 

Modern merchandise switching chips generally have 
a programmable parser, which can be used to extract ar- 
bitrary fields. The TCAM matching module is flexible 
enough to implement EM, LPM [25], and range match- 
ing. Hence, DA-based forwarding can be well supported. 

For tag-based forwarding, many commodity switching 
chips for Metro Ethernet Network already support MPLS 
(multiple protocol label switching), which is the repre- 
sentative tag-based forwarding technology. Those chips 
support POP/PUSH/SWAP operations on the MPLS la- 
bels in the packet header. Hence we can support tag- 
based forwarding by selecting a switching chip with 
MPLS support. Further, by using tag stacking and POP 
Operations, we can also support nISR-based forwarding. 
In such nISR design, the source fills a stack of tags to de- 
note the routing path and the intermediate switches use 
the outermost tag for table lookup and then pops the tag 
before forwarding the packet. 


3.3. Server Software 
3.3.1 Kernel Components 


The ServerSwitch driver is the central hub that receives 
all incoming traffic from the underlying ServerSwitch 
card. The driver can process them itself or it can de- 
liver them to the user space for further processing. Pro- 
cessing them in the driver gives higher performance but 
requires more effort to program and debug. Meanwhile, 
processing these packets in user space is easy for devel- 
opment but scarifies performance. Instead of making a 
choice on behalf of users, ServerSwitch allows users to 
decide which one to use. For low rate control plane traf- 
fic where processing performance is not a major concern, 
e.g., ARP packets, ServerSwitch can deliver them to user 
space for applications to process them. Since the ap- 
plications need to send control plane traffic too, Server- 
Switch provides APIs to receive packets from user-space 
applications to be sent down to the NIC chips. For those 
control plane packets with low latency requirement and 
high speed in-network processing traffic whose perfor- 
mance is a major concern, e.g., QCN queue queries or 
data cache traffic, we can process them in the Server- 
Switch driver. 

The SC and NIC drivers both act as the data channels 
between the switching chip and the ServerSwitch driver. 
They receive packets from the device and deliver them to 
the ServerSwitch driver, and vice versa. The SC driver 
also provides an interface for the user library and the 
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ServerSwitch to manipulate its registers directly, so both 
applications and the ServerSwitch driver can control the 
switching chip directly. 


3.3.2 APIs 


We design a set of APIs to control the switching chip and 
send/receive packets. The APIs include five categories as 
follows. 

1. Set User Defined Lookup Key (UDLK): This API 
configures the programmable parser in the switching 
chip by setting the i-th UDLK. In this API, the UDLK 
can be fields from the packet header as well as meta- 
data, e.g., the incoming port of a packet. We use the most 
generic form to define packet header fields, i.e., the byte 
position of the desired fields. In the following example, 
we set the destination MAC address (6 bytes, BO-5) as 
the first UDLK. We can also combine meta-data (e.g., 
incoming port) and non-consecutive byte range to define 
a UDLK, as shown in the second statement which 1s used 
for BCube (8 5.1). 


API: 
SetUDLK (int i, UDLK udl1lk) 
Example: 
SeECUDLK (1, (BQ-5)) 
SeCUDLK (2; ({INPORT,; B30=-33,;, B42—45) ) 


2. Set Lookup Table: There are several lookup tables 
in the switching chip, a general purpose TCAM table, 
and protocol specific lookup tables for Ethernet, IP, and 
MPLS. This API configures different lookup tables de- 
noted by type, and sets the value, mask and action for 
the i-th entry. The mask is NULL when the lookup ta- 
ble is an EM table. The action is a structure that defines 
the actions to be taken for the matched packets, e.g., di- 
recting the packets to a specified output port, performing 
pre-defined header modifications, etc. For example, for 
MPLS the modification actions can be Swap/Pop/Push. 
The iudlk is the index of UDLK to be compared. iudlk is 
ignored for the tables that do not support UDLK. 

In the following example, the statement sets the first 
TCAM entry and compares the destination MAC address 
(the first UDLK) with the value field (000001020001, 
i.e., 00:00:01:02:00:01) using mask (FFFFFFOOOOOO). 
This statement is used to perform LPM on dest MAC for 
PortLand. Consequently, all matching packets are for- 
warded to the third virtual interface. 


API: 
SetLookupTable(int type, int i, 
int iudlk, char «value, char*+ mask, 
ACTION *action) 
Example: 


SetLookupTable(TCAM, 1, 
ie SOOUGOUOLOZ000L", "HFFFEFFOQOOGO", 
{act=REDIRECT VIF, vift=3}) 
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3. Set Virtual Interface Table: This API sets up the i- 
th virtual interface entry which contains destination and 
source MAC addresses as well as the output port. The 
MAC addresses are used to replace the original MACs in 
the packet when they are not NULL. 

For example, the following command sets up the third 
virtual interface to deliver packets to output port 2. 
Meanwhile, the destination MAC is changed to the given 
value (001F29D417E8) accordingly. The edge switches 
in Portland need such functionality to change PMAC 
back to the original MAC (83.2 in [22]). 


AP Ls 
SetVIfTable(int i, char x*dmac, 
char *Smac, int oport) 
Example: 


SsecViftTable(3, “OOLFP29D4L7HS", NUL; 2) 


4. Read/Write Registers: There are many statistic reg- 
isters in switching chip, e.g., queue length and packet 
counters, and registers to configure the behaviors of the 
switching chip, e.g., enable/disable L3 processing. This 
API is to read and write those registers (specified by reg- 
name). As an example, the following command returns 
the queue length (in bytes) of output port 0. 


APT? 
int ReadRegister (int regname) 


int WriteRegister(int regname, int value) 


Example: 
ReadRegister (OUTPUT_QUEUE_BYTES_PORTO) 


5. Send/Receive Packet: There are multiple NICs for 
sending and receiving packets. We can use the first API 
to send packet to a specific NIC port (oport). When we 
receive a packet, the second API also provides the input 
NIC port (iport) for the packet. 


API: 
int SendPacket (char x*pkt, int oport) 
int RecvPacket (char *pkt, int *iport) 


4 Implementation 


4.1 ServerSwitch Card 


Fig. 3 shows the ServerSwitch card we designed. All 
chips used on the card are merchandise ASICs. The 
Broadcom switching chip BCM56338 has 8 Gigabit Eth- 
ernet (GE) ports and two LOGE ports [1]. Four of the 
GE ports connect externally and the other four GE ports 
connect to two dual GE port Intel 82576EB NIC chips. 
The two NIC chips are used to carry a maximum of 
AGb/s traffic between the switching chip and the server 
since the bandwidth of the PCI-E interface on 56338 is 
only 2Gb/s. The three chips connect to the server via 
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Figure 3: ServerSwitch card. 


a PCI-E switch PLX PEX8617. The effective band- 
width from the PEX8617 to BCM56338, the two NIC 
chips and the server are 2, 8, 8 and 8Gb/s (single direc- 
tion). Since the maximum inbound or outbound traffic is 
4Gb/s, PCI-E is not the bottleneck. The two 1OGE XAUI 
ports are designed for interconnecting multiple Server- 
Switch cards in one server chassis to create a larger non- 
blocking switching fabric with more ports. Each Server- 
Switch card costs less than 400$ when manufactured in 
100 pieces. We expect the price can be cut to 200$ for 
a quantity of 1OK. The power consumption of Server- 
Switch is 15.4W when all 8 GE ports are idle, and is 
15.7W when all of them carry full speed traffic. 

Fig. 4 shows the packet processing pipeline of the 
switching chip, which has three stages. First, when the 
packets go into the switching chip, they are directed to 
a programmable parser and a classifier. The classifier 
then directs the packets to one of the protocol specific 
header parsers. The Ethernet parser extracts the desti- 
nation MAC address (DMAC), the IP parser extracts the 
destination IP address (DIP), the MPLS parser extracts 
the MPLS label and the Prog parser can generate two 
different UDLKs. Each UDLK can contain any aligned 
four 4-byte blocks from the first 128 bytes of the packet, 
and some meta-data of the packet. 

Next, the DMAC is sent to the EM(MAC) matching 
module, the DIP to both the LPM and EMCUIP) matching 
modules, the MPLS label to the EM(MPLS) module, and 
the UDLK to the TCAM. Each TCAM entry can select 
one of the two UDLKs to match. The matchings are per- 
formed in parallel. The three matching modules (EM, 
LPM, TCAM) result in an index into the interface table, 
which contains the output port, destination and source 
MAC. When multiple lookup modules match, the prior- 
ity of their results follows TCAM > EM > LPM. 

Finally, the packet header is modified by the L3 and L2 
modifiers accordingly. The L3 modifier changes the L3 
header, e.g., IP TTL, IP checksum and MPLS label. The 
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Figure 4: Packet processing pipeline in Broadcom 
56338. 


L2 modifier can use the MAC addresses in the interface 
table to replace the original MAC addresses. 

The size of EM tables for MAC, IPv4 and MPLS are 
32K, 8K and 4K entries, respectively. The LPM for IPv4 
and the TCAM table have 6144 and 2K entries, respec- 
tively. The interface table has 4K entries. All these ta- 
bles, the Prog Parser and the behaviors of the modifiers 
are programmable. 


4.2 Kernel Drivers 


We have developed ServerSwitch kernel drivers for Win- 
dows Server 2008 R2. As shown in Fig. 1, it has compo- 
nents as follows. 

Switching Chip Driver. We implemented a PCI-E 
driver based on Broadcom’s Dev Kits. The driver has 
2670 lines of C code. It allocates a DMA region and 
maps the chip’s registers into memory address using 
memory-mapped I/O (MMIO). The driver can deliver re- 
ceived packets to the ServerSwitch driver, and send pack- 
ets to hardware. The ServerSwitch driver and user library 
can access the registers and thus control the switching 
chip via this SC driver. 

NIC Driver. We directly use the most recent Intel NIC 
driver binaries. 

ServerSwitch Driver. We implemented the Server- 
Switch driver as a Windows NDIS MUX driver. It has 
20719 lines of C code. The driver exports itself as a vir- 
tual NIC. It binds the TCP/IP stack on its top and the In- 
tel NIC driver and the SC driver at its bottom. The driver 
uses IRP to send and receive packets from the user li- 
brary. It can also deliver the packets to the TCP/IP stack. 
The ServerSwitch driver provides a kernel framework for 
developing various DCN designs. 


4.3. User Library 


The library is based on the Broadcom SDK. The SDK 
has 3000K-+ lines of C code and runs only on Linux and 
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Figure 5: BCube header on the ServerSwitch platform. 


VxWorks. We ported this SDK to Subsystem for UNIX- 
based Applications (SUA) on Windows Server 2008 [2]. 
At the bottom of the SDK, we added a library to interact 
with our kernel driver. We then developed ServerSwitch 
APIs over the SDK. 


5 Building with the ServerSwitch Platform 


In this section, we use ServerSwitch to implement sev- 
eral representative DCN designs. First, we implement 
BCube to illustrate how indexed source routing is sup- 
ported in the switching chip of ServerSwitch. In our 
BCube implementation, BCube packet forwarding is 
purely carried out in hardware. Second, we show our 
implementation of QCN congestion control. Our QCN 
implementation demonstrates that our ServerSwitch can 
generate low latency control messages using the server 
CPU. Due to space limitation, we discuss how Server- 
Switch can support other DCN designs in our technical 
report [19]. 


5.1 BCube 


BCube is a server centric DCN architecture [13]. BCube 
uses adaptive source routing. Source servers probe mul- 
tiple paths and select the one with the highest available 
bandwidth. BCube defines two types of control mes- 
sages, for neighbor discovery (ND) and available band- 
width query (ABQ) respectively. The first one is for 
servers to maintain the forwarding table. The second one 
is used to probe the available bandwidth of the multiple 
parallel paths between the source and destination. 

Our ServerSwitch is an ideal platform for implement- 
ing BCube. For an intermediate server in BCube, our 
ServerSwitch card can offload packet forwarding from 
the server CPU. For source and destination servers, our 
ServerSwitch card can achieve k:1 speedup using k& NICs 
connected by BCube topology. This is because in our 
design the internal bandwidth between the server and the 
NICs is equal to the external bandwidth provided by the 
multiple NICs, as we show in Figure 1. 
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Fig. 5 shows the BCube header we use. It consists of 
an IP header and a private header (gray fields). We use 
this private header to implement the BCube header. We 
use an officially unassigned IP protocol number to dif- 
ferentiate the packet from normal TCP/UDP packets. In 
the private header, the BCube protocol is used to iden- 
tify control plane messages. NH is the number of valid 
NHA fields. It is used by a receiver to construct a reverse 
path to the sender. There are 8 1-byte Next Hop Address 
(NHA) fields, defined in BCube for indexed source rout- 
ing. Different from NHA in the original BCube header 
design, NHAs are filled in reverse order in our private 
header. NHA, is now the lookup key for the last hop. 
This implementation adaption is to obtain an automatic 
index counter by the hardware. We observe that for a 
normal IP packet, its TTL is automatically decreased af- 
ter one hop. Therefore, we overload the TTL field in the 
IP header as the index field for NHAs. This is the reason 
why we store NHAs in reverse order. 


We implemented a BCube kernel module in the 
ServerSwitch driver and a BCube agent at the user-level. 
The kernel module implements data plane functionali- 
ties. On the receiving direction, it delivers all received 
control messages to the user-level agent for processing. 
For any received data packets, it removes their BCube 
headers and delivers them to the TCP/IP stack. On the 
sending direction, it adds the BCube header for the pack- 
ets from the TCP/IP stack and sends them to the NICs. 

The BCube agent implements all control plane func- 
tionalities. It first sets up the ISR-based forwarding rules 
and the packet filter rules in the switching chip. Then, 
it processes the control messages. When it receives 
an ND message, it updates the interface table using 
SetVIfTable. It periodically uses ReadRegister 
to obtain traffic volume from the switching chip and cal- 
culates the available bandwidth for each port. When it 
receives an ABQ message, it encodes the available band- 
width in the ABQ message, and sends it to the next hop. 

Fig. 6 shows the procedure to initialize the switch- 
ing chip for BCube, using the ServerSwitch API. Line 
1 sets a 12-byte UDLK, for source routing, including 
TTL (B22) and the NHA fields (B34-41). Line 2 sets 
another 9-byte UDLKg for packet filtering, including in- 
coming port number (INPORT), IP destination address 
(B30-—33) and BCube protocol (B42). The INPORT oc- 
cupies |-byte field. Lines 5-18 set the ISR-based TCAM 
table. Since every NHA denotes a neighbor node with a 
destination MAC and corresponding output port, line 8 
sets up one interface entry for one NHA value. Lines 13- 
16 sets up a TCAM entry to match the TTL and its corre- 
sponding NHA in UDLK,. Since the switch discards the 
IP packets whose TTL < 1, we use TTL = 2 to denote 
NHA,. Lines 21-38 filter packets to the server. Since the 
switching chip has four external (0-3) and four internal 


1: SetUDLF(1, (B22-25, B34-41)); 

2: SetUDLF(2, (INPORT, B30-33, B42-45)); 

Se 

4: // setup ISR-based forwarding table 

Sf 40; 

6: foreach nha in (all possible NHA values) 

vs { 

OA SetVifTable(nha, dstmac, srcmac, oport); 

Os for (index = 0; index < 8; index+t) 

1% { 

see // val[0] matches B22 (TTL) in UDLF 1 

Lies // val[4:11] matches B34-41 (NHAs) in UDLF_ 1 
ee val [0] = index+2; mask[0] = Oxff; 

14; val [4+index] = nha; mask[4+index] = Oxff; 

Le: achion.act = REDIRECT VIFF; aclion.vilt = nha; 
16 SetLookupTable (TCAM, j++,1,val,mask, &action) ; 
Vy: } 

Les: 

Lo 

20: // setup filter to server 

Ze tor (i. =] OF 2a < ay a4) 

228 4 

23% actLon.act = REDIRECT PORT? aClion. Port = 4 -F. 2 
24: // filter packets that are sent to localhost 
2 // val[0] matches INPORT in UDLF 2 

26: // vVal[1:4] match B30-33 (IP dst addr) in UDLF 2 
24% val [0] = i; mask[0O] = Oxff 

28: val[1:4] = my _beube id; mask[1:4] = OxffffrfTrtt; 
ZO SetLookupTable (TCAM, j++,2,val,mask, &action) ; 
S0% // filter control plane packets 

SAS // val[5] matches B42 (BCube prot) in UDLF 2 
22% val[0O] = i; mask[O] = Oxff; 

aoe val[5] = ND; mask[5] = Oxff; 

34: SetLookupTable (TCAM, j++,2,val,mask, &action) ; 

So val[0O] = i; mask[O] = Oxff; 

36s val[5] = ABO; mask[5] = Oxff; 

a7 SetLookupTable (TCAM, j++,2,val,mask, &action) ; 
38: } 


Figure 6: Pseudo TCAM setup code for BCube. 


ports (4-7), we filter the traffic of an external port to a 
corresponding internal port, 7.e., port 0-4, 1-5, 2-6 
and 3-47. Line 23 sets action to direct the packets to 
port 4 ~ 7 respectively. Lines 27-29 match those packets 
whose destination BCube address equals the local BCube 
address in UDLKy. Lines 32-37 match BCube control 
plane messages, 7.e., ND and ABQ, in UDLKg. In our 
switching chip, when a packet matches multiple TCAM 
entries, the entry with the highest index will win. There- 
fore, in our BCube implementation, entries for control 
plane messages have higher priority than the other ones. 


5.2 Quantized Congestion Control (QCN) 


QCN is a rate-based congestion control algorithm for the 
Ethernet environment [7]. The algorithm has two parts. 
The Switch or Congestion Point (CP) adaptively samples 
incoming packets and generates feedback messages ad- 
dressed to the source of the sampled packets. The feed- 
back message contains congestion information at the CP. 
The Source or Reaction Point (RP) then reacts based on 
the feedback from the CP. See [7] for QCN details. The 
previous studies of QCN are based on simulation or hard- 
ware implementation. 
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Figure 7: QCN on the ServerSwitch platform. 


We implemented QCN on the ServerSwitch platform 
as shown in Fig. 7. The switching chip we use cannot 
adaptively sample packets based on the queue length, so 
we let the source mark packets adaptively and let the 
ServerSwitch switching chip mirror the marked packets 
to the ServerSwitch CPU. When the ServerSwitch CPU 
receives the marked packets, it immediately reads the 
queue length from the switching chip and sends the Con- 
gestion Notification (CN) back to the source. When the 
source receives the CN message, it adjusts its sending 
rate and marking probability. 


We implemented the CP and RP algorithms in Server- 
Switch and end-host respectively based on the most re- 
cent QCN Pseudo code V2.3 [24]. In order to minimize 
the response delay, the CP module is implemented in the 
ServerSwitch driver. The CP module sets up a TCAM 
entry to filter marked packets to the CPU. On the end- 
host, we implemented a token bucket rate limiter in the 
kernel to control the traffic sending rate at the source. 


6 Evaluation 


Our evaluation has two parts. In the first part, we show 
micro benchmarks for our ServerSwitch. We evaluate its 
performance on packet forwarding, register read/write, 
and in-network caching. For micro benchmark evalua- 
tion, we connect our ServerSwitch to a NetFPGA card 
and use NetFPGA to generate traffic. In the second part, 
we implement two DCN designs, namely BCube and 
QCN, using ServerSwitch. We build a 16-server BCube, 
network to run BCube and QCN experiments. We cur- 
rently build only two ServerSwitch cards. As shown in 
Fig. 8, the two gray nodes are equipped with Server- 
Switch cards, they use an ASUS motherboard with Intel 
Quad Core 17 2.8GHz CPU. The other 14 servers are Dell 
Optiplex 755 with 2.4Ghz dual core CPU. The switches 
are 8-port DLink DGS-1008D GE switches. 
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6.1 Micro Benchmarks 


We directly connect the four GE ports of one Server- 
Switch to the four GE ports of one NetFPGA, and use 
the NetFPGA-based packet generator to generate line- 
rate traffic to evaluate the packet forwarding performance 
of ServerSwitch. We record the packet send and re- 
ceive time using NetFPGA to measure the forwarding 
latency of ServerSwitch. The precision of the timestamp 
recorded by NetFPGA is 8s. 


Forwarding Performance. Fig. 9 compares the for- 
warding performance of our ServerSwitch card and a 
software-based BCube implementation using an ASUS 
quad core server. In the evaluation, we use NetFPGA to 
generate 4GE traffic. The software implementation of the 
BCube packet forwarding is very simple. It uses NHA as 
an index to get the output port. (See 87.2 in [13] for more 
details) As we can see, there is a very huge performance 
gap between these two approaches. For ServerSwitch, 
there is no packet drop for any packet sizes, and the for- 
warding delay is small. The delays for 64 bytes and 1514 
bytes are 4.3us and 15.6us respectively, and it grows lin- 
early with the packet size. The slope is 7.7ns per byte, 
which is very close to the transmission delay of one byte 
over a GE link. The curve suggests the forwarding de- 
lay is a 4.2us fixed processing delay plus the transmis- 
sion delay. For software forwarding, the maximum PPS 
achieved is 1.73Mpps and packets get dropped when the 
packet size is less than or equal to 512 bytes. The CPU 
utilization for 1514 byte 1s already 65.6%. Moreover, the 
forwarding delay is also much larger than that of Server- 
Switch. This experiment suggests that a switching chip 
does a much better job for packet forwarding, and that 
using software for ‘simple’ packet forwarding is not effi- 
cient. 


Register Read/Write Performance. Certain applica- 
tions need to read and write registers of the switching 
chip frequently. For example, our software-based QCN 
needs to frequently read queue length from the switch- 
ing chip. In this test, we continuously read/write a 32- 
bit register 1,000,000 times, and the average R/W la- 
tency of one R/W operation is 6.94/4.6lus. We note 
that the latency is larger than what has been reported be- 
fore (around lws) [20]. This is because [20] measured 
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Figure 9: Packet forwarding performance. 


the latency of a single MMIO R/W operation, whereas 
our registers are not mapped but are accessed indirectly 
via several mapped registers. In our case, a read opera- 
tion consists of four MMIO write and three MMIO read 
operations. We note that the transmission delay of one 
1514-bytes packet over 1GE link is 12ws, so the read op- 
eration of our ServerSwitch can be finished within the 
transmission time of one packet. 

In-network Caching Performance. We show that 
ServerS witch can be used to support in-network caching. 
In this experiment, ServerSwitch uses two GbE ports 
to connect to NetFPGA A and the other two ports to 
NetFPGA B. NetFPGA A sends request packets to B via 
ServerSwitch. When B receives one request, it replies 
with one data packet. The sizes of request and reply are 
128 and 1514 bytes, respectively. Every request or reply 
packet carries a unique ID in its packet header. When 
ServerSwitch receives a request from A, the switching 
chip performs an exact matching on the ID of the request. 
A match indicates that the ServerSwitch has already 
cached the response packet. The request is then for- 
warded to the server CPU which sends back the cached 
copy to A. When there is no match, the request is for- 
warded to B, and B sends back the response data. Server- 
Switch also oversees the response data and tries to cache 
a local copy. The request rate per link is 85.8Mb/s, so 
the response rate per link between ServerSwitch and A is 
966Mb/s. Since one NetFPGA has 4 ports, we use one 
NetFPGA to act as both A and B in the experiment. 

We vary the cache hit ratio at ServerSwitch and mea- 
sure the CPU overhead of the ServerSwitch. In-network 
caching increases CPU usage at ServerSwitch, but saves 
bandwidth between B and ServerSwitch. In our toy 
network setup, a «% cache hit rate directly results in 
x% bandwidth saving between B and ServerSwitch (as 
shown in Fig. 10). In a real network environment, we 
expect the savings will be more significant since we can 
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Figure 10: CPU utilization for in-network caching. 


save more bandwidth for multi-hop cases. 

Fig. 10 also shows the CPU overhead of the Server- 
Switch for different cache hit ratios. Of course, the 
higher the cache hit ratio, the more bandwidth we can 
save and the more CPU usage we need to pay. Note that 
in Fig. 10, even when the cache hit ratio is 0, we still have 
a cost of 14% CPU usage. This is because ServerSwitch 
needs to do caching for the 1.9Gbps response traffic from 
B to ServerSwitch. Fig. 10 also includes the CPU over- 
head of a pure software-based caching implementation. 
Our result clearly shows that our ServerSwitch signifi- 
cantly outperforms pure software-based caching. 


6.2 ServerSwitch based BCube 


In this experiment, we set up two TCP connections Cl 
and C2 between servers 01 and 10. The two connections 
use two parallel paths, Pl {01, 00, 10} for C1 and P2 
{01, 11, 10} for C2, respectively. We run this experiment 
twice. First, we configure 00 and 11 to use the Server- 
Switch cards for packet forwarding. Next, we configure 
them to use software forwarding. In both cases, the total 
throughput is 1.7Gbps and is split equally into the two 
parallel paths. When using ServerSwitch for forward- 
ing, both OO and 11 use zero CPU cycles. When using 
software forwarding, both servers use 15% CPU cycles. 
Since both servers have a quad core CPU, 15% CPU us- 
age equals 60% for one core. 


6.3 ServerSwitch based QCN 


In this experiment, we configure server 00 to act as a 
QCN-enabled node. We use iperf to send UDP traffic 
from server 01 to 10 via OO. The sending rate of iperf 
is limited by the traffic shaper at 01. When there is con- 
gestion on level-1 port of 00, 00 sends CN to 01. We use 
the QCN baseline parameters [7] in this experiment. 
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Fig. 11 shows the throughput of the UDP traffic and 
the output queue length at server OO. When we start the 
UDP traffic, level 1 port is 1Gb/s. There is no conges- 
tion and the output queue length is zero. At time 20s, we 
limit level 1 port at OO to 200Mb/s, the queue immedi- 
ately builds up and causes OO to send CN to the source. 
The source starts to use the QCN algorithm to adjust its 
traffic rate in order to maintain the queue length around 
Q_EQ which is 5OKB in this experiment. We can see that 
the sending rate decreases to 200Mb/s very fast. And 
then we increase the bandwidth by 200Mb/s every 20 
seconds. Similarly, the source adapts quickly to the new 
bandwidth. As shown in the figure, the queue length fluc- 
tuates around Q_EQ. This shows that this software-based 
implementation performs good congestion control. The 
rate of queue query packets processed by node 00 is very 
low during the experiment, with maximum and mean val- 
ues of 801 and 173 pps. Hence QCN message processing 
introduces very little additional CPU overhead. The to- 
tal CPU utilization is smaller than 5%. Besides, there is 
no packet drop in the experiment, even at the point when 
we decrease the bandwidth to 200Mb/s. QCN therefore 
achieves lossless packet delivery. We have varied the 
Q_EQ from 25KB to 200KB and the results are similar. 

The extra delay introduced by our software approach 
to generate a QCN queue reply message consists of three 
parts: directing the QCN queue query to the CPU, read- 
ing the queue register, and sending back the QCN queue 
reply. To measure this delay, we first measure time 
RTT, between the QCN query and reply at 01. Then 
we configure the switching chip to simply bounce the 
QCN query back to the source assuming zero delay re- 
sponse for hardware implementation. We measure the 
time RT’T> between sending and receiving a QCN query 
at O1. RTT) - RT'T> reflects the extra delay introduced 
by software. The packet sizes of the queue query and 
reply are both 64 bytes in this measurement. The aver- 
age values of RT'T, and RT’T> are 41lus and 18us based 
on 10,000 measurements. Our software introduces only 
23us delay. This extra delay is tolerable since it is com- 
parable to or smaller than the packet transmission delay 
for one single 1500-bytes in a multi-hop environment. 


7 Discussion 


Limitations of ServerSwitch. The current version of 
ServerSwitch has the following limitations: 1) Limited 
hardware forwarding programmability. The switching 
chip we use has limited programmability on header field 
modification. It supports only standard header modifi- 
cations of supported protocols (e.g., changing Ethernet 
MAC addresses, decreasing IP TTL, changing IP DSCP, 
adding/removing IP tunnel header, modifying MPLS 
header). Due to the hardware limitation, our implemen- 
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Figure 11: Throughput and queue dynamics during band- 
width change. 


tation of index-based source routing has to re-interpret 
the IP TTL field. 2) Relatively high packet processing la- 
tency due to switching chip to CPU communication. For 
the packets that require ‘real’ per-packet processing such 
as congestion information calculation in XCP protocol, 
the switching chip must deliver them to the CPU for pro- 
cessing, which leads to higher latency. Hence Server- 
Switch is not suitable for protocols that need real time 
per-packet processing such as XCP. 3) Restricted form 
factor and relatively low speed. At present, a Server- 
Switch card provides only 4 GbE ports. Though it can be 
directly used for server-centric or hybrid designs, e.g., 
BCube, DCell, and CamCube, we do not expect that 
the current ServerSwitch can be directly used for archi- 
tectures that need a large number of switch ports (48- 
ports or more), e.g., fat-tree and VL2. However, since 
4 ServerSwitch cards can be connected together to pro- 
vide 16 ports, we believe ServerSwitch is still a viable 
platform for system prototyping for such architectures. 


10GE ServerSwitch. Using the same hardware archi- 
tecture, we can build a 1OGE ServerSwitch. We need to 
upgrade the Ethernet switching chip, the PCI-E switch- 
ing chip and the NIC chips. As for the Ethernet switch- 
ing chip, 1OGbE switching chips with 24x 10GbE ports or 
more are already available from Broadcom, Fulcrum or 
Marvell. We can use two dual LOGbE Ethernet controller 
chips to provide a 40Gb/s data channel between the card 
and server CPU. Since we do not expect all traffic to be 
delivered to the CPU for processing, the internal band- 
width between the card and the server does not need to 
match the total external bandwidth. In this case, the num- 
ber of external 1OGE ports can be larger than four. We 
also need to upgrade the PCI-E switching chip to provide 
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an upstream link with 40Gb/s bandwidth, which requires 
PCI-E Gen2 x8. Since the signal rate on the board is 10x 
faster than that in the current ServerSwitch, more hard- 
ware engineering effort will be needed to guarantee the 
Signal Integrity (SI). 

All the chips discussed above are readily available in 
the market. The major cost of such a lOGbE card comes 
from the 1OGbE Ethernet switching chip, which has a 
much higher price than the 8xGbE switching chip. For 
example, a chip with 24 1OGbE ports may cost about 10x 
that of the current one. The NIC chip and PCI-E switch- 
ing chip cost about 2x~3x than current ones. Overall, 
we expect the 1OGE version card to be about 5x more 
expensive than the current 1GE version. 


$ Related Work 


OpenFlow defines an architecture for a central controller 
to manage OpenFlow switches over a secure channel, 
usually via TCP/IP. It defines a specification to manage 
the flow table inside the switches. Both OpenFlow and 
ServerSwitch aim towards a more programmable net- 
working platform. Aiming to provide both programma- 
bility and high performance, ServerSwitch uses multiple 
PCI-E lanes to interconnect the switching chip and the 
server. The low latency and high speed of the channel en- 
ables us to harness the resources in a commodity server 
to provide both programmable control and data planes. 
With Openflow, however, it is hard to achieve similar 
functionalities due to the higher latency and lower band- 
width between switches and the controller. 

Orphal provides a common API for proprietary 
switching hardware [21], which is similar to our APIs. 
Specifically, they also designed a set of APIs to manage 
the TCAM table. Our work is more than API design. We 
introduce a novel TCAM table based method for index- 
based source routing. We also leverage the resources of 
a commodity server to provide extra programmability. 

Flowstream uses commodity switches to direct traf- 
fic to commodity servers for in-network processing [12]. 
The switch and the server are loosely coupled, i.e., the 
server cannot directly control the switching chip. In 
ServerSwitch, the server and the switching chip are 
tightly coupled, which enables ServerSwitch to provide 
new functions such as software-defined congestion con- 
trol which requires low-latency communication between 
the server and the switching chip. 

Recently, high performance software routers, e.g., 
RouteBricks [10] and PacketShader [16] have been de- 
signed and implemented. By leveraging multi-cores, 
they can achieve tens of Gb/s throughput. ServerSwitch 
is complementary to these efforts in that ServerSwitch 
tries to offload certain packet forwarding tasks from the 
CPU to a modern switching chip. ServerSwitch also tries 


to optimize its software to process low latency pack- 
ets such as congestion control messages. At present, 
due to hardware limitations, ServerSwitch only provides 
4x1GE ports. RouteBricks or PacketShader can certainly 
leverage a future 1OGE ServerSwitch card to provide a 
higher throughput system, with a portion of traffic for- 
warded by the switching chip. 

Commercial switches generally have an embedded 
CPU for switch management. More recently, Arista’s 
7100 series introduces the use of dual-core x86 CPU 
and provides APIs for programmable management plane 
processing. ServerSwitch differs from existing com- 
modity switches in two ways: (1) The CPUs in com- 
modity switches mainly focus on management functions, 
whereas ServerSwitch explores a way to combine the 
switching chip with the most advanced CPUs and server 
architecture. On this platform, the CPUs can process 
forwarding/control/management plane packets with high 
throughput and low latency. The host interface on the 
switching chip usually has limited bandwidth since the 
interface is designed for carrying control/management 
messages. ServerSwitch overcomes this limitation by 1n- 
troducing additional NIC chips for a high bandwidth, yet 
low latency channel between the switching chip and the 
server; (2) ServerSwitch tries to provide a common set 
of APIs to program the switch chip. The APIs are de- 
signed to be as universal as possible. Ideally, the API is 
the same no matter what kind of switching chip is used. 

Ripcord [9] mainly focuses on the DCN control plane. 
It currently uses OpenFlow switches as its data plane. 
Our work is orthogonal to their work. We envision that 
they can also use ServerSwitch to support new DCN such 
as BCube, and to support more routing schemes such as 
source routing and tag-based routing. 


9 Conclusion 


We have presented the design and implementation of 
ServerSwitch, a programmable and high performance 
platform for data center networks. ServerSwitch ex- 
plores the design space of integrating a high perfor- 
mance, limited programmable ASIC switching chip with 
a powerful, fully programmable multicore commodity 
server. 

ServerSwitch achieves easy-to-use programmability 
by using the server system to program and control the 
switching chip. The switching chip can be programmed 
to support a flexible packet header format and various 
user defined packet forwarding designs with line-rate 
without the server CPU intervening. By leveraging the 
low latency PCI-E interface and efficient server software 
design, we can implement software defined signaling and 
congestion control in the server CPU with low CPU over- 
head. The rich programmability provided by Server- 
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Switch can further enable new DCN services that need 
in-network data processing such as in-network caching. 

We have built a ServerSwitch card and a whole Server- 
Switch software stack. Our implementation experiences 
demonstrate that ServerSwitch can be fully constructed 
from commodity, inexpensive components. Our develop- 
ment experiences further show that ServerSwitch is easy 
to program, using the standard C/C++ language and de- 
velopment tool chains. We have used our ServerSwitch 
platform to construct several recently proposed DCN de- 
signs, including new DCN architectures BCube and Port- 
Land, congestion control algorithm QCN, and DCN in- 
network caching service. 

Our software API currently focuses on lookup table 
programmability and queue information query. Current 
switching chips also provide advanced features such as 
queue and buffer management, access control, and pri- 
ority and fair queueing scheduling. We plan to extend 
our API to cover these features in our future work. We 
also plan to upgrade the current 1GE hardware to 10G in 
the next version. We expect that ServerSwitch may be 
used for networking research beyond DCN (e.g., enter- 
prise networking). We plan to release both the Server- 
Switch card and the software package to the networking 
research community in the future. 
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Abstract—We present TritonSort, a highly efficient, scal- 
able sorting system. It is designed to process large datasets, 
and has been evaluated against as much as 100 TB of input 
data spread across 832 disks in 52 nodes at a rate of 0.916 
TB/min. When evaluated against the annual Indy GraySort 
sorting benchmark, TritonSort is 60% better in absolute 
performance and has over six times the per-node efficiency 
of the previous record holder. In this paper, we describe 
the hardware and software architecture necessary to oper- 
ate TritonSort at this level of efficiency. Through careful 
management of system resources to ensure cross-resource 
balance, we are able to sort data at approximately 80% of 
the disks’ aggregate sequential write speed. 

We believe the work holds a number of lessons for bal- 
anced system design and for scale-out architectures in gen- 
eral. While many interesting systems are able to scale lin- 
early with additional servers, per-server performance can 
lag behind per-server capacity by more than an order of 
magnitude. Bridging the gap between high scalability and 
high performance would enable either significantly cheaper 
systems that are able to do the same work or provide the 
ability to address significantly larger problem sets with the 
same infrastructure. 


1 Introduction 


The need for large-scale computing is increasing, driven 
by search engines, social networks, location-based ser- 
vices, and biological and scientific applications. The 
value of these applications is defined by the quality 
and quantity of data over which they operate, result- 
ing in very high I/O and storage requirements. These 
Data-intensive Scalable Computing systems, or DISC 
systems[8], require searching and sorting large quanti- 
ties of data spread across the network. Sorting forms the 
kernel of many data processing tasks in the datacenter, 
exercises computing, I/O, and storage resources, and is a 
key bottleneck for many large-scale systems. 

Several new DISC software architectures have 
been developed recently, including MapReduce[9], the 
Google File System[11], Hadoop[22], and Dryad[14]. 
These systems are able to scale linearly with the num- 
ber of nodes in the cluster, making it trivial to add new 
processing capability and storage capacity to an existing 
cluster by simply adding more nodes. This linear scala- 


bility is achieved in part by exposing parallel program- 
ming models to the user and by performing computation 
on data locally whenever possible. Hadoop clusters with 
thousands of nodes are now deployed in practice [23]. 


Despite this linear scaling behavior, per-node perfor- 
mance has lagged behind per-server capacity by more 
than an order of magnitude. A survey of several de- 
ployed DISC sorting systems[4] found that the impres- 
sive results obtained by operating at high scale mask a 
typically low individual per-node efficiency, requiring 
a larger-than-needed scale to meet application require- 
ments. For example, among these systems as much as 
94% of available disk I/O and 33% CPU capacity re- 
mained idle[4]. The largest known industrial Hadoop 
clusters achieve only 20 Mbps of average bandwidth for 
large-scale data sorting on machines theoretically capa- 
ble of supporting a factor of 100 more throughput. 


In this work we present TritonSort, a highly efficient 
sorting system designed to sort large volumes of data 
across dozens of nodes. We have applied it to data sets 
as large as 100 terabytes spread across 832 disks in 52 
nodes. The key to TritonSort’s efficiency is its balanced 
software architecture, which is able to effectively make 
use of a large amount of co-located storage per node, en- 
suring that the disks are kept as utilized as possible. Our 
results show the benefit of our design: evaluating Triton- 
Sort against the ‘Indy’ GraySort benchmark[19] resulted 
in a system that was able to sort 1OOTB of input tuples 
in about 60% of the absolute time of the previous record- 
holder, but with four times fewer resources, resulting in 
an increase in per-node efficiency by over a factor of six. 


It is important to note that our focus in building Tri- 
tonSort is to highlight the efficiency gains that can be 
obtained in building systems that process significant 
amounts of data through balancing computation, stor- 
age, memory, and network. Systems such as Hadoop and 
Dryad further support data-level replication, transparent 
node failure, and a generalized computational model, all 
of which are not currently present in TritonSort. How- 
ever, in presenting TritonSort’s hardware and software 
architecture, we describe several lessons learned in its 
construction that we believe are generalizable to other 
data processing systems. For example, our design relies 
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on a very high disk-to-node ratio as well as an explicit, 
application-level management of in-memory buffers to 
minimize disk seeks and thus increase read and write 
throughput. We choose buffer sizes to balance time spent 
processing multiple stages of our sort pipeline, and trade 
off the utilization of one resource for another. 

Our experiences show that for a common datacenter 
workload, systems can be built with commodity hard- 
ware and open-source software that improve on per-node 
efficiency by an order of magnitude while still achiev- 
ing scalability. Building such systems will either enable 
significantly cheaper systems to be able to do the same 
work or provide the ability to address significantly larger 
problem sets with the same infrastructure. 

The primary contributions of this paper are: 1) the se- 
lection of a balanced hardware platform tuned to support 
a large-scale sort application, 2) a sort application im- 
plemented on top of a staged, pipeline-oriented software 
runtime that supports performance tuning via selection 
of appropriate buffer sizes and quantities, 3) an examina- 
tion of projected sort performance when bottlenecks are 
removed, and 4) a discussion of the experience gained in 
building and deploying this prototype at scale. 


2 Design Challenges 


In this paper, we focus on designing systems that sort 
large datasets as an instance of the larger problem of 
building balanced systems. Here, we present our precise 
problem formulation, discuss the challenges involved, 
and outline the key insights underlying our approach. 


2.1 Problem Formulation 


We seek to design a system that sorts large volumes of 
input data. Based on the specification of the sort bench- 
mark [19], our input data comprises 100 byte tuples with 
a 10 byte key and 90 byte value. We target deployments 
with input data on the order of tens to hundreds of TB of 
randomly-generated tuples. The input data is stored as 
a collection of files on persistent storage. The goal of a 
sorting system is to transform this input data into an or- 
dered set of output files, also stored on persistent storage, 
such that the concatenation of these output files in order 
constitutes the sorted version of the input data. Our goal 
is to design and implement a sorting system that can sort 
datasets of the targeted size while achieving a favorable 
tradeoff between speed, resource utilization, and cost. 


2.2 The Challenge of Efficient Sorting 


Sorting large datasets places stress on several resources 
in acluster. First, storing tens to hundreds of TB of input 
and output data demands a large amount of storage ca- 
pacity. Given the size of the data and modern commod- 
ity hard drive capacities, the data must be stored across 
several storage devices and almost certainly across many 


NSDI ’11: 8th USENIX Symposium on Networked Systems Design and Implementation 


machines. Second, reading the input data and writing 
the output data across many disks simultaneously places 
load on both storage devices and I/O controllers. Third, 
since the tuples are distributed randomly across the in- 
put files, almost all of the large dataset to be sorted will 
have to be sent over the network. Finally, comparing tu- 
ples in order to sort them requires a non-trivial amount 
of compute power. This combination of demands makes 
designing a sorting system that efficiently utilizes all of 
these resources challenging. 


Our key design principle to ensure good resource uti- 
lization is to construct a balanced system—a system that 
drives all resources at as close to 100% utilization as pos- 
sible. For any given application and workload, there will 
be an ideal configuration of hardware resources in keep- 
ing with the application’s demands on these resources. 
In practice, the set of hardware configurations available 
is limited by the availability of components (one cannot 
currently, for example, buy a processor with exactly 13 
cores), and so a configuration must be chosen that best 
meets the application’s demands. Once that hardware 
configuration is determined, the application must be ar- 
chitected to suitably exploit the full capabilities of the 
deployed hardware. In the following section, we outline 
our considerations in designing such a balanced system, 
including our choice of a specific hardware and software 
architecture. We did not choose this platform with sort- 
ing in mind, and so we believe that our design generalizes 
to other DISC problems as well. 


2.3 Design Considerations 


Our system’s design 1s motivated by three main consider- 
ations. First, we rely only on commodity hardware com- 
ponents. This is both to keep the costs of our system rel- 
atively low and to have our system be representative of 
today’s data centers so that the lessons we learn can be 
applied to other applications with workload characteris- 
tics similar to those of sort. Hence, we do not make use 
of networking substrates such as Infiniband that provide 
high network bandwidth at high cost. Also, despite the 
recent emergence of solid state drives (SSDs) that pro- 
vide higher I/O rates, we chose to use hard disks because 
they continue to provide the most affordable option for 
high capacity storage and streaming I/O. We believe that 
properly-architected sorting software should not stress 
random I/O behavior, where SSDs currently excel. 
Second, we focus our software architecture on mini- 
mizing disk seeks. In the particular hardware configu- 
ration we chose, the key bottleneck for sort among the 
various system resources 1s disk I/O bandwidth. Hence, 
the primary goal of the system is to enable all disks to 
operate continuously at peak bandwidth. The main chal- 
lenge in sustaining peak disk bandwidth is to minimize 
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the amount of time the disks spend seeking, since any 
time seeking is not spent transferring data. 

Third, we choose to focus on hardware architectures 
whose total memory cannot contain the entire dataset. 
One possible implementation of sort is to read all the 
input data into memory, appropriately shuffle the data 
across machines in the cluster, sort the local in-memory 
data on each machine, and then write the sorted data 
to the local disks. Note that in this case, every tuple is 
read from and written to persistent storage exactly once. 
However, this implementation would require an amount 
of memory at least equal to the amount of input data; 
given that the cost per GB of RAM is over 70 times more 
than that of disks, such a design would significantly drive 
up costs and be infeasible for large input datasets. 

Instead, we pursue an alternative implementation 
wherein every tuple is read and written multiple times 
from disk before the data is completely sorted. Storing 
intermediate results on disk makes the system’s memory 
requirement far more modest. Sorting data on clusters 
that have less memory than the total amount of data to be 
sorted requires every input tuple to be read and written 
at least twice [1]. Since every additional read and write 
increases the time to sort, we seek to achieve exactly this 
lower bound to maximize system performance. 


2.4 Hardware Architecture 


To determine the right hardware configuration for our ap- 
plication, we make the following observations about the 
sort workload. First, the application needs to read ev- 
ery byte of the input data and the size of the input is 
equal to that of the output. Since the “working set” is 
so large, it does not make sense to separate the cluster 
into computation-heavy and storage-heavy regions. In- 
stead, we provision each server in the cluster with an 
equal amount of processing power and disks. 

Second, almost all of the data needs to be exchanged 
between machines since input data is randomly dis- 
tributed throughout the cluster and adjacent tuples in the 
sorted sequence must reside on the same machine. To 
balance the system, we need to ensure that this all-to-all 
shuffling of data can happen in parallel without network 
bandwidth becoming a bottleneck. Since we focus on 
using commodity components, we use an Ethernet net- 
work fabric. Commodity Ethernet is available in a set 
of discrete bandwidth levels—1 Gbps, 10 Gbps, and 40 
Gbps—with cost increasing proportional to throughput 
(see Table 1). Given our choice of 7.2k-RPM disks for 
storage, a 1 Gbps network can accommodate at most one 
disk per server without the network throttling disk I/O. 
Therefore, we settle on a 10 Gbps network; 40 Gbps 
Ethernet has yet to mature and hence is still cost pro- 
hibitive. To balance a 10 Gbps network with disk I/O, 
we use a Server that can host 16 disks. Based on the op- 


8 disks, 8 CPU cores $5,050 








8 disks, 16 CPU cores $5,450 
16 disks, 16 CPU cores $7,550 


Table 1: Resource options considered for constructing a 
cluster for a balanced sorting system. These values are 
estimates as of January, 2010. 


tions available commercially for such a server, we use a 
server that hosts 16 disks and 8 CPU cores. The choice of 
8 cores was driven by the available processor packaging: 
two physical quad-core CPUs. The larger the number 
of separate threads, the more stages that can be isolated 
from each other. In our experience, the actual speed of 
each of these cores was a secondary consideration. 


Third, sort demands both significant capacity and I/O 
requirements from storage since tens to hundreds of TB 
of data is to be stored and all the data is to be read and 
written twice. To determine the best storage option given 
these requirements, we survey a range of hard disk op- 
tions shown in Table 1. We find that 7.2k-RPM SATA 
disks provide the most cost-effective option in terms of 
balancing $ per GB and $ per read/write MBps (assum- 
ing we can achieve streaming I/O). To allow 16 disks to 
operate at full streaming I/O throughput, we require stor- 
age controllers that are able to sustain at least 1600 MBps 
of streaming bandwidth. Because of the PCI bus’ band- 
width limitations, our hardware design necessitated two 
8x PCI drive controllers, each supporting 8 disks. 


The final design choice in provisioning our cluster is 
the amount of memory each server should have. The 
primary purpose of memory in our system is to enable 
large amounts of data buffering so that we can read from 
and write to the disk in large chunks. The larger these 
chunks become, the more data can be read or written be- 
fore seeking is required. We initially provisioned each of 
our machines with 12 GB of memory; however, during 
development we realized that 24 GB was required to pro- 
vide sufficiently large writes, and so the machines were 
upgraded. We discuss this addition when we present our 
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architecture in Section 3. One of the key takeaways from 
our work is the important role that buffering plays in en- 
abling high utilization of the network, disk, and CPU. 
Determining the appropriate amount of memory buffer- 
ing is not straightforward and we leave to future work 
techniques that help automate this process. 


2.5 Software Architecture 


To maximize cluster resource utilization, we need to de- 
sign an appropriate software architecture. There are a 
range of possible software architectures in keeping with 
our constraint of reading and writing every input tuple at 
most twice. The class of architectures upon which we 
focus share a similar basic structure. These architectures 
consist of two phases separated by a distributed barrier, 
so that all nodes must complete phase one before phase 
two begins. In the first phase, input data is read from disk 
and routed to the node upon which it will ultimately re- 
side. Each node is responsible for storing a disjoint por- 
tion of the key space. When data arrives at its destination 
node, that node writes the data to its local disks. In the 
second phase, each node sorts the data on its local disks 
in parallel. At the end of the second phase, each node has 
a portion of the final sorted sequence stored on its local 
disks, and the sorted sequences stored on all nodes can be 
concatenated together to form the final sorted sequence. 

There are several possible implementations of this 
general architecture, but any implementation contains 
at least a few basic software elements. These software 
elements include Readers that read data from on-disk 
files into in-memory buffers, Writers that write buffers to 
disk, Distributors that distribute a buffer’s tuples across 
a set of logical divisions and Sorters that sort buffers. 

Our initial implementation of TritonSort was designed 
as a distributed parallel external merge-sort. This ar- 
chitecture, which we will call the Heaper-Merger archi- 
tecture, is structured as follows. In phase one, Readers 
read from the input files into buffers, which are sorted 
by Sorters. Each sorted buffer is then passed to a Dis- 
tributor, which splits the buffer into a sorted chunk per 
node and sends each chunk to its corresponding node. 
Once received, these sorted chunks are heap-sorted by 
software elements called Heapers in batches and each 
resulting sorted batch is written to an intermediate file 
on disk. In the second phase, software elements called 
Mergers merge-sort the intermediate files on a given disk 
into a single sorted output file. 

The problem with the Heaper-Merger architecture is 
that it does not scale well. In order to prevent the Heaper 
in phase one from becoming a bottleneck, the length of 
the sorted runs that the Heaper generates is usually fairly 
small, on the order of a few hundred megabytes. As a 
consequence, the number of intermediate files that the 
Merger must merge in phase two grows quickly as the 
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Figure 1: Performance of a Heaper-Merger sort imple- 
mentation in microbenchmark on a 200GB per disk par- 
allel external merge-sort as a function of the number of 
files merged per disk. 


size of the input data increases. This reduces the amount 
of data from each intermediate file that can be buffered at 
a time by the Merger and requires that the merger fetch 
additional data from files much more frequently, causing 
many additional seeks. 


To demonstrate this problem, we implemented a sim- 
ple Heaper-Merger sort module in microbenchmark. We 
chose to sort 200GB per disk in parallel across all the 
disks to simulate the system’s performance during a 
1O0OTB sort. Each disk’s 200GB data set is partitioned 
among an increasingly large number of files. Each node’s 
memory is divided such that each input file and each 
output file can be double-buffered. As shown in Fig- 
ure 1, increasing the number of files being merged causes 
throughput to decrease dramatically as the number of 
files increases above 1000. 


TritonSort uses an alternative architecture with simi- 
lar software elements as above and again involving two 
phases. We partition the input data into a set of logical 
partitions; with D physical disks and L logical partitions, 
each logical partition corresponds to a contiguous a 
fraction of the key space and each physical disk hosts 
logical partitions. In the first phase, Readers pass buffers 
directly to Distributors. A Distributor maps the key of 
every tuple in its input buffer to its corresponding logical 
partition and sends that tuple over the network to the ma- 
chine that hosts this logical partition. Tuples for a given 
logical partition are buffered in memory and written to 
disk in large chunks in order to seek as little as possible. 
In the second phase, each logical partition is read into 
an in-memory buffer, that buffer is sorted, and the sorted 
buffer is written to disk. This scheme bypasses the seek 
limits of the earlier mergesort-based approach. Also, by 
appropriately choosing the value of L, we can ensure that 
logical partitions can be read, sorted and written in par- 
allel in the second phase. Since our testbed nodes have 
24GB of RAM, to ensure this condition we set the num- 
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ber of logical partitions per node to 2520 so that each 
logical partition contains less than 1GB of tuples when 
we sort 100 TB on 52 nodes. We explain this architec- 
ture in more detail in the context of our implementation 
in the next section. 


3 Design and Implementation 


TritonSort is a distributed, staged, pipeline-oriented 
dataflow processing system. In this section, we describe 
TritonSort’s design and motivate our design decisions for 
each stage in its processing pipeline. 


3.1 Architecture Overview 


Figures 2 and 7 show the stages of a TritonSort program. 
Stages in TritonSort are organized in a directed graph 
(with cycles permitted). Each stage in TritonSort im- 
plements part of the data processing pipeline and either 
sources, sinks, or transmutes data flowing through it. 

Each stage is implemented by two types of logical 
entities—several workers and a single WorkerTracker . 
Each worker runs in its own thread and maintains its own 
local queue of pending work. We refer to the discrete 
pieces of data over which workers operate as work units 
or simply as work. The WorkerTracker is responsible for 
accepting work for its stage and assigning that work to 
workers by enqueueing the work onto the worker’s work 
queue. In each phase, all the workers for all stages in that 
phase run in parallel. 

Upon starting up, a worker initializes any required in- 
ternal state and then waits for work. When work arrives, 
the worker executes a stage-specific run() method that 
implements the specific function of the stage, handling 
work in one of three ways. First, it can accept an indi- 
vidual work unit, execute the run() method over it, and 
then wait for new work. Second, it can accept a batch of 
work (up to a configurable size) that has been enqueued 
by the WorkerTracker for its stage. Lastly, it can keep its 
run() method active, polling for new work explicitly. Tri- 
tonSort stages implement each of these methods, as de- 
scribed below. In the process of running, a stage can pro- 
duce work for a downstream stage and optionally specify 
the worker to which that work should be directed. If a 
worker does not specify a destination worker, work units 
are assigned to workers round-robin. 

In the process of executing its run() method, a worker 
can get buffers from and return buffers to a shared pool 
of buffers. This buffer pool can be shared among the 
workers of a single stage, but is typically shared between 
workers in pairs of stages with the upstream stage getting 
buffers from the pool and the downstream stage putting 
them back. When getting a buffer from a pool, a stage 
can specify whether or not it wants to block waiting for 
a buffer to become available if the pool is empty. 


3.2 Sort Architecture 


We implement sort in two phases. First, we perform dis- 
tribution sort to partition the input data across L logical 
partitions evenly distributed across all nodes in the clus- 
ter. Each logical partition is stored in its own logical disk. 
All logical disks are of identical maximum size sizer p 
and consist of files on the local file system. 

The value of s2zezp is chosen such that logical disks 
from each physical disk can be read, sorted and written 
in parallel in the second phase, ensuring maximum re- 
source utilization. Therefore, if the size of the input data 
iS SiZ€input, there are L = ae logical disks in the 
system. In phase two, the tuples in each logical disk get 
sorted locally and written to an output file. This imple- 
mentation satisfies our design goal of reading and writing 
each tuple twice. 

To determine which logical disk holds which tuples, 
we logically partition the 10-byte key space into L even 
divisions. We logically order the logical disks such that 
the k*” logical disk holds tuples in the k*” division. Sort- 
ing each logical disk produces a collection of output files, 
each of which contains sorted tuples in a given partition. 
Hence, the ordered collection of output files represents 
the sorted version of the data. In this paper, we assume 
that tuples’ keys are distributed uniformly over the key 
range which ensures that each logical disk 1s approxi- 
mately the same size; we discuss how TritonSort can be 
made to handle non-uniform key ranges in Section 6.1. 

To ensure that we can utilize as much read/write band- 
width as possible on each disk, we partition the disks on 
each node into two groups of 8 disks each. One group 
of disks holds input and output files; we refer to these 
disks as the input disks in phase one and as the output 
disks in phase two. The other group holds intermediate 
files; we refer to these disks as the intermediate disks. In 
phase one, input files are read from the input disks and 
intermediate files are written to the intermediate disks. In 
phase two, intermediate files are read from the intermedi- 
ate disks and output files are written to the output disks. 
Thus, the same disk is never concurrently read from and 
written to, which prevents unnecessary seeking. 


3.3. TritonSort Architecture: Phase One 


Phase one of TritonSort, diagrammed in Figure 2, is re- 
sponsible for reading input tuples off of the input disks, 
distributing those tuples over to the network to the nodes 
on which they belong, and storing them into the logical 
disks in which they belong. 

Reader: Each Reader is assigned an input disk and is 
responsible for reading input data off of that disk. It does 
this by filling 80 MB ProducerBuffers with input data. 
We chose this size because it is large enough to obtain 
near sequential throughput from the disk. 
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Figure 2: Block diagram of TritonSort’s phase one architecture. The number of workers for a stage is indicated in the 
lower-right corner of that stage’s block, and the number of disks of each type is indicated in the lower-right corner of 


that disk’s block. 
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Figure 3: The NodeDistributor stage, responsible for par- 
titioning tuples by destination node. 


NodeDistributor: A NodeDistributor (shown in Fig- 
ure 3) receives a ProducerBuffer from a Reader and is re- 
sponsible for partitioning the tuples in that buffer across 
the machines in the cluster. It maintains an internal data 
structure called a NodeBuffer table, which is an array of 
NodeBuffers, one for each of the nodes in the cluster. A 
NodeBuffer contains tuples belonging to the same desti- 
nation machine. Its size was chosen to be the size of the 
ProducerBuffer divided by the number of nodes, and is 
approximately 1.6 MB in size for the scales we consider 
in this paper. 

The NodeDistributor scans the ProducerBuffer tuple 
by tuple. For each tuple, it computes a hash function 
H(k) over the tuple’s key & that maps the tuple to a 
unique host in the range [0, N — 1]. It uses the Node- 
Buffer table to select a NodeBuffer corresponding to host 
H(k) and appends the tuple to the end of that buffer. If 
that append operation causes the buffer to become full, 
the NodeDistributor removes the NodeBuffer from the 
NodeBuffer table and sends it downstream to the Sender 
stage. It then gets a new NodeBuffer from the Node- 
Buffer pool and inserts that buffer into the newly empty 
slot in the NodeBuffer table. Once the NodeDistributor 
is finished processing a ProducerBuffer, it returns that 
buffer back to the ProducerBuffer pool. 


Sender: The Sender stage (shown in Figure 4) is 
responsible for taking NodeBuffers from the upstream 
NodeDistributor stage and transmitting them over the 
network to each of the other nodes in the cluster. Each 
Sender maintains a separate TCP socket per peer node 
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Figure 4: The Sender stage, responsible for sending data 
to other nodes. 


send() 


in the cluster. The Sender stage can be implemented 
in a multi-threaded or a single-threaded manner. In the 
multi-threaded case, NV Sender workers are instantiated 
in their own threads, one for each destination node. Each 
Sender worker simply issues a blocking send() call on 
each NodeBuffer it receives from the upstream NodeDis- 
tributor stage, sending tuples in the buffer to the appro- 
priate destination node over the socket open to that node. 
When all the tuples in a buffer have been sent, the Node- 
Buffer is returned to its pool, and the next one is pro- 
cessed. For reasons described in Section 4.1, we choose 
a single-threaded Sender implementation instead. Here, 
the Sender interleaves the sending of data across all the 
destination nodes in small non-blocking chunks, so as to 
avoid the overhead of having to activate and deactivate 
individual threads for each send operation to each peer. 


Unlike most other stages, which process a single unit 
of work during each invocation of their run() method, the 
Sender continuously processes NodeBuffers as it runs, 
receiving new work as it becomes available from the 
NodeDistributor stage. This is because the Sender must 
remain active to alternate between two tasks: accept- 
ing incoming NodeBuffers from upstage NodeDistribu- 
tors, and sending data from accepted NodeBuffers down- 
stream. To facilitate accepting incoming NodeBuffers, 
each Sender maintains a set of NodeBuffer lists, one for 
each destination host. Initially these lists are empty. The 
Sender appends each NodeBuffer it receives onto the list 
of NodeBuffers corresponding to the incoming Node- 
Buffer’s destination node. 
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Figure 5: The Receiver stage, responsible for receiving 
data from other nodes’ Sender stages. 


To send data across the network, the Sender loops 
through the elements in the set of NodeBuffer lists. If 
the list is non-empty, the Sender accesses the Node- 
Buffer at the head of the list, and sends a fixed-sized 
amount of data to the appropriate destination host using 
a non-blocking send() call. If the call succeeds and some 
amount of data was sent, then the NodeBuffer at the head 
of the list 1s updated to note the amount of its contents 
that have been successfully sent so far. If the send() call 
fails, because the TCP send buffer for that socket is full, 
that buffer is simply skipped and the Sender moves on 
to the next destination host. When all of the data from 
a particular NodeBuffer is successfully sent, the Sender 
returns that buffer back to its pool. 


Receiver: The Receiver stage, shown in Figure 5, 
is responsible for receiving data from other nodes in 
the cluster, appending that data onto a set of Node- 
Buffers, and passing those NodeBuffers downstream to 
the LogicalDiskDistributor stage. In TritonSort, the Re- 
ceiver stage is instantiated with a single worker. On 
starting up, the Receiver opens a server socket and ac- 
cepts incoming connections from Sender workers on re- 
mote nodes. Its run() method begins by getting a set of 
NodeBuffers from a pool of such buffers, one for each 
source node. The Receiver then loops through each of 
the open sockets, reading up to 16KB of data at a time 
into the NodeBuffer for that source node using a non- 
blocking recv() call. This small socket read size is due 
to the rate-limiting fix that we explain in Section 4.1. If 
data is returned by that call, it is appended to the end 
of the NodeBuffer. If the append would exceed the size 
of the NodeBuffer, that buffer is sent downstream to the 
LogicalDiskDistributor stage, and a new NodeBuffer is 
retrieved from the pool to replace the NodeBuffer that 
was sent. 


LogicalDiskDistributor: The — LogicalDisk- 
Distributor stage, shown in Figure 6, receives Node- 
Buffers from the Receiver that contain tuples destined 
for logical disks on its node. LogicalDiskDistributors 
are responsible for distributing tuples to appropriate 
logical disks and sending groups of tuples destined for 
the same logical disk to the downstream Writer stage. 
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Figure 6: The LogicalDiskDistributor stage, responsible 
for distributing tuples across logical disks and buffering 
sufficient data to allow for large writes. 


The LogicalDiskDistributor’s design is driven by the 
need to buffer enough data to issue large writes and 
thereby minimize disk seeks and achieve high band- 
width. Internal to the LogicalDiskDistributor are two 
data structures: an array of LDBuffers, one per logical 
disk, and an LDBufferTable. An LDBuffer is a buffer 
of tuples destined to the same logical disk. Each LD- 
Buffer is 12,800 bytes long, which is the least common 
multiple of the tuple size (100 bytes) and the direct I/O 
write size (512 bytes). The LDBufferTable is an array 
of LDBuffer lists, one list per logical disk. Additionally, 
LogicalDiskDistributor maintains a pool of LDBuffers, 
containing 1.25 million LDBuffers, accounting for 20 of 
each machine’s 24 GB of memory. 


Algorithm 1 The LogicalDiskDistributor stage 
1: NodeBuffer — getNewWork() 
2: {Drain NodeBuffer into the LDBufferArray } 
3: for all tuples ¢ in NodeBuffer do 
4 dst = H(key(t)) 
5: LDBufferArray[dst].append(t) 
6 if LDBufferArray[dst].isFullQ then 
7 LDTable.insert(LDBufferArray[dst]) 
8 LDBufferArray[dst] = getEmptyLDBuffer() 
9: end if 
10: end for 
11: {Send full LDBufferLists to the Coalescer} 
12: for all physical disks d do 
13: while LDTable.sizeOfLongestList(d) > 5MB do 


14: Id ~— LDTable.getLongestList(d) 
15: Coalescer.pushNew Work(ld) 

16: end while 

17: end for 


The operation of a LogicalDiskDistributor worker is 
described in Algorithm |. In Line 1, a full NodeBuffer 
is pushed to the LogicalDiskDistributor by the Receiver. 
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Lines 3-10 are responsible for draining that NodeBuffer 
tuple by tuple into an array of LDBuffers, indexed by the 
logical disk to which the tuple belongs. Lines 12-17 ex- 
amine the LDBufferTable, looking for logical disk lists 
that have accumulated enough data to write out to disk. 
We buffer at least 5 MB of data for each logical disk 
before flushing that data to disk to prevent many small 
write requests from being issued if the pipeline temporar- 
ily stalls. When the minimum threshold of 5 MB is met 
for any particular physical disk, the longest LDBuffer list 
for that disk is passed to the Coalescer stage on Line 15. 


The original design of the LogicalDiskDistributor only 
used the LDBuffer array described above and used much 
larger LDBuffers (~1OMB each) rather than many small 
LDBuffers. The Coalescer stage (described below) did 
not exist; instead, the LogicalDiskDistributor transferred 
the larger LDBuffers directly to the Writer stage. 


This design was abandoned due to its inefficient use 
of memory. Temporary imbalances in input distribution 
could cause LDBuffers for different logical disks to fill at 
different rates. This, in turn, could cause an LDBuffer to 
become full when many other LDBuffers in the array are 
only partially full. If an LDBuffer is not available to re- 
place the full buffer, the system must block (either imme- 
diately or when an input tuple is destined for that buffer’s 
logical disk) until an LDBuffer becomes available. One 
obvious solution to this problem is to allow partially full 
LDBuffers to be sent to the Writers at the cost of lower 
Writer throughput. This scheme introduced the further 
problem that the unused portions of the LDBuffers wait- 
ing to be written could not be used by the LogicalDisk- 
Distributor. In an effort to reduce the amount of memory 
wasted in this way, we migrated to the current architec- 
ture, which allows small LDBuffers to be dynamically 
reallocated to different logical disks as the need arises. 
This comes at the cost of additional computational over- 
head and memory copies, but we deem this cost to be 
acceptable due to the small cost of memory copies rela- 
tive to disk seeks. 


Coalescer: The operation of the Coalescer stage is 
simple. A Coalescer will copy tuples from each LD- 
Buffer in its input LDBuffer list into a WriterBuffer and 
pass that WriterBuffer to the Writer stage. It then returns 
the LDBuffers in the list to the LDBuffer pool. 


Originally, the LogicalDiskDistributor stage did the 
work of the Coalescer stage. While optimizing the sys- 
tem, however, we realized that the non-trivial amount of 
time spent merging LDBuffers into a single WriterBuffer 
could be better spent processing additional NodeBuffers. 


Writer: The operation of the Writer stage is also quite 
simple. When a Coalescer pushes a WriterBuffer to it, 
the Writer worker will determine the logical disk corre- 
sponding to that WriterBuffer and write out the data us- 
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Figure 7: Block diagram of TritonSort’s phase two archi- 
tecture. The number of workers for a stage 1s indicated in 
the lower-right corner of that stage’s block, and the num- 
ber of disks of each type is indicated in the lower-right 
corner of that disk’s block. 


ing a blocking write() system call. When the write com- 
pletes, the WriterBuffer is returned to the pool. 


3.4 TritonSort Architecture: Phase Two 


Once phase one completes, all of the tuples from the in- 
put dataset are stored in appropriate logical disks across 
the cluster’s intermediate disks. In phase two, each of 
these unsorted logical disks is read into memory, sorted, 
and written out to an output disk. The pipeline is straight- 
forward: Reader and Writer workers issue sequential, 
streaming I/O requests to the appropriate disk, and Sorter 
workers operate entirely in memory. 

Reader: The phase two Reader stage is identical to 
the phase one Reader stage, except that it reads into a 
PhaseTwoBuffer, which is the size of a logical disk. 

Sorter: The Sorter stage performs an in-memory sort 
on a PhaseTwoBuffer. A variety of sort algorithms can 
be used to implement this stage, however we selected the 
use of radix sort due to its speed. Radix sort requires ad- 
ditional memory overhead compared to an in-place sort 
like QuickSort, and so the sizes of our logical disks have 
to be sized appropriately so that enough Reader—Sorter— 
Writer pipelines can operate in parallel. Our version 
of radix sort first scans the buffer, constructing a set of 
structures containing a pointer to each tuple’s key and 
a pointer to the tuple itself. These structures are then 
sorted by key. Once the structures have been sorted, they 
are used to rearrange the tuples in the buffer in-place. 
This reduces the memory overhead for each Sorter sub- 
stantially at the cost of additional memory copies. 

Writer: The phase two Writer writes a PhaseTwo- 
Buffer sequentially to a file on an output disk. As in 
phase one, each Writer is responsible for writes to a sin- 
gle output disk. 

Because the phase two pipeline operates at the granu- 
larity of a logical disk, we can operate several of these 
pipelines in parallel, limited by either the number of 
cores in each system (we can’t have more pipelines than 
cores without sacrificing performance because the Sorter 
is CPU-bound), the amount of memory in the system 
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(each pipeline requires at least three times the size of a 
logical disk to be able to read, sort, and write in parallel), 
or the throughput of the disks. In our case, the limiting 
factor is the output disk bandwidth. To host one phase 
two pipeline per input disk requires storing 24 logical 
disks in memory at a time. To accomplish this, we set 
sizezp to 850M B, using most of the 24 GB of RAM 
available on each node and allowing for additional mem- 
ory required by the operating system. To sort 850M B 
logical disks fast enough to not block the Reader and 
Writer stages, we find that four Sorters suffice. 


3.5 Stage and Buffer Sizing 


One of the major requirements for operating TritonSort 
at near disk speed is ensuring cross-stage balance. Each 
stage has an intrinsic execution time, either based on the 
speed of the device to which it interfaces (e.g., disks or 
network links), or based on the amount of CPU time it re- 
quires to process a work unit. Figure 8 shows the speed 
and performance of each stage in the pipeline. In our im- 
plementation, we are limited by the speed of the Writer 
stage in both phases one and two. 


4 Optimizations 


In implementing the TritonSort architecture, we learned 
that several non-obvious optimizations were necessary to 
meet our desired goal of driving every disk at full utiliza- 
tion. Here, we present the key takeaways from our expe- 
rience. In each case, we believe these lessons generalize 
to a wide variety of DISC systems. 


4.1 Network 


For TritonSort to operate at the aggregate sequential 
streaming bandwidth of all of its disks, the network must 
be able to sustain the read throughput of eight disks while 
data is being shuffled among nodes in the first phase. 
Since the 7.2kK-RPM disks we use deliver at most 100 
MBps of sequential read throughput (Table 1), the net- 


work must be able to sustain 6.4 Gbps of all-pairs band- 
width, irrespective of the number of nodes in the cluster. 

It is well-known that sustaining high-bandwidth flows 
in datacenter networks, especially all-to-all patterns, is a 
significant challenge. Reasons for this include commod- 
ity datacenter network hardware, incast, queue buildup, 
and buffer pressure[2]. Since we could not employ a 
strategy like that presented in [2] to provide fair but high 
bandwidth flow rates among the senders, we chose in- 
stead to artificially rate limit each flow at the Sender 
stage to its calculated fair share by forcing the sockets 
to be receive window limited. This works for TritonSort 
because |) each machine sends and receives at approx- 
imately the same rate, 2) all the nodes share the same 
RTT since they are interconnected by a single switch, 
and 3) our switch does not impose an oversubscription 
factor. In this case, each Sender should ideally send at 
a rate of (6.4/N) Gbps, or 123 Mbps with a cluster of 
52 nodes. Given that our network offers approximately 
100usec RTTs, a receiver window size of 8 — 16 KB 
ensures that the flows will not impose queue buildup or 
buffer pressure on other flows. 

Initially, we chose a straightforward multi-threaded 
design for the Sender and Receiver stages in which there 
were NV Senders and N Receivers, one for each Triton- 
Sort node. In this design, each Sender issues block- 
ing send() calls on a NodeBuffer until it is sent. Like- 
wise, on the destination node, each Receiver repeatedly 
issues blocking recv() calls until a NodeBuffer has been 
received. Because the number of CPU hyperthreads on 
each of our nodes 1s typically much smaller than 2, we 
pinned all Senders’ threads to a single hyperthread and 
all Receivers’ threads to a single separate hyperthread. 

Figure 9 shows that this multi-threaded approach does 
not scale well with the number of nodes, dropping below 
4 Gbps at scale. This poor performance is due to thread 
scheduling overheads at the end hosts. 16 KB TCP re- 
ceive buffers fill up much faster than connections that are 
not window-limited. At the rate of 123 MBps, a 16 KB 
buffer will fill up in just over 1 ms, causing the Sender 
to stop sending. Thus, the Receiver stage must clear out 
each of its buffers at that rate. Since there are 52 such 
buffers, a Receiver must visit and clear a receive buffer in 
just over 20 pus. A Receiver worker thread cannot drain 
the socket, block, go to sleep, and get woken up again 
fast enough to service buffers at this rate. 

To circumvent this problem we implemented a single- 
threaded, non-blocking receiver that scans through each 
socket in round-robin order, copying out any available 
data and storing it in a NodeBuffer during each pass 
through the array of open sockets. This implementation 
is able to clear each socket’s receiver buffer faster than 
the arrival rate of incoming data. Figure 9 shows that 
this design scales well as the cluster grows. 
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Figure 8: Median stage runtimes for a 52-node, 1OOTB sort, excluding the amount of time spent waiting for buffers. 
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Figure 10: Microbenchmark indicating the ideal disk 
throughput as a function of write size 


4.2 Minimizing Disk Seeks 


Key to making the TritonSort pipeline efficient is min- 
imizing the total amount of time spent performing disk 
seeks, both while writing data in phase one, and while 
reading that data in phase two. As individual write sizes 
get smaller, the throughput drops, since the disk must oc- 
casionally seek between individual write operations. Fig- 
ure 10 shows disk write throughput measured by a syn- 
thetic workload generator writing to a configurable set of 
files with different write sizes. Ideally, the Writer would 
receive WriterBuffers large enough that it can write them 
out at close to the sequential rate of the disk, e.g., 80 
MB. However, the amount of available memory limits 
TritonSort’s write sizes. Since the tuple space is uni- 
formly distributed across the logical disks, the Logical- 
DiskDistributor will fill its LDBuffers at approximately 
a uniform rate. Buffering 80 MB worth of tuples for a 
given logical disk before writing to disk would cause the 
buffers associated with all of the other logical disks to 
become approximately as full. This would mandate sig- 
nificantly higher memory needs than what is available 
in our hardware architecture. Hence, the LogicalDisk- 
Distributor stage must emit smaller WriterBuffers, and it 
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must interleave writes to different logical disks. 


4.3 The Importance of File Layout 


The physical layout of individual logical disk files plays a 
strong role in trading off performance between the phase 
one Writer and the phase two Reader. One strategy is to 
append to the logical disk files in a log-structured man- 
ner, in which a WriterBuffer for one logical disk is im- 
mediately appended after the WriterBuffer for a different 
logical disk. This is possible if the logical disks’ blocks 
are allocated on demand. It has the advantage of mak- 
ing the phase one Writer highly performant, since it min- 
imizes seeks and leads to near-sequential write perfor- 
mance. On the other hand, when a phase two Reader 
begins reading a particular logical disk, the underlying 
physical disk will need to seek frequently to read each of 
the WriterBuffers making up the logical disk. 

An alternative approach is to greedily allocate all of 
the blocks for each of the logical disks at start time, en- 
suring that all of a logical disk’s blocks are physically 
contiguous on the underlying disk. This can be accom- 
plished with the fallocate() system call, which provides 
a hint to the file system to pre-allocate blocks. In this 
scheme, interleaved writes of WriterBuffers for different 
logical disks will require seeking since two subsequent 
writes to different logical disks will need to write to dif- 
ferent contiguous regions on the disk. However, in phase 
two, the Reader will be able to sequentially read an en- 
tire logical disk with minimal seeking. We also use fallo- 
cate() on input and output files so that phase one Readers 
and phase two Writers seek as little as possible. 

The location of output files on the output disks also 
has a dramatic effect on phase two’s performance. If we 
do not delete the input files before starting phase two, the 
output files are allocated space on the interior cylinders 
of the disk. When evaluating phase two’s performance on 
a 100 TB sort, we found that we could write to the inte- 
rior cylinders of the disk at an average rate of 64 MBps. 
When we deleted the input files before phase two began, 
ensuring that the output files would be written to the ex- 
terior cylinders of the disk, this rate jumped to 84 MBps. 
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For the evaluations in Section 5, we delete the input files 
before starting phase two. For reference, the fastest we 
have been able to write to the disks in microbenchmark 
has been approximately 90 MBps. 


4.4 CPU Scheduling 


Modern operating systems support a wide variety of 
static and dynamic CPU scheduling approaches, and 
there has been considerable research into scheduling dis- 
ciplines for data processing systems. We put a significant 
amount of effort into isolating stages from one another by 
setting the processor affinities of worker threads explic- 
itly, but we eventually discovered that using the default 
Linux scheduler results in a steady-state performance 
that is only about 5% worse than any custom scheduling 
policy we devised. In our evaluation, we use our custom 
scheduling policy unless otherwise specified. 


4.5 Pipeline Demand Feedback 


Initially, TritonSort was entirely “push’’-based, meaning 
that a worker only processed work when it was pushed 
to it from a preceding stage. While simple to design, cer- 
tain stages perform sub-optimally when they are unable 
to send feedback back in the pipeline as to what work 
they are capable of doing. For example, the throughput 
of the Writer stage in phase one is limited by the latency 
of writes to the intermediate disks, which is governed by 
the sizes of WriterBuffers sent to it as well as the physical 
layout of logical disks (due to the effects of seek and ro- 
tational delay). In its naive implementation, the Logical- 
DiskDistributor sends work to the Writer stage based on 
which of its LDBuffer lists is longest with no regard to 
how lightly or heavily loaded the Writers themselves are. 
This can result in an imbalance of work across Writers, 
with some Writers idle and others struggling to process a 
long queue of work. This imbalance can destabilize the 
whole pipeline and lower total throughput. 

To address this problem, we must effectively com- 
municate information about the sizes of Writers’ work 
queues to upstream stages. We do this by creating a pool 
of write tokens. Every write token is assigned a single 
“parent” Writer. We assign parent Writers in round-robin 
order to tokens as the tokens are created and create a 
number of tokens equal to the number of WriterBuffers. 
When the LogicalDiskDistributor has buffered enough 
LDBuffers so that one or more of its logical disks is 
above the minimum write threshold (SMB), the Logical- 
DiskDistributor will query the write token pool, passing 
it a set of Writers for which it has enough data. If a write 
token is available for one of the specified Writers in the 
set, the pool will return that token, otherwise it will signal 
that no tokens are available. The LogicalDiskDistributor 
is required to pass a token for the target Writer along with 


its LDBuffer list to the next stage, This simple mech- 
anism prevents any Writer’s work queue from growing 
longer than its “fair share” of the available WriterBuffers 
and provides reverse feedback in the pipeline without 
adding any new architectural features. 


4.6 System Call Behavior 


In the construction of any large system, there are always 
idiosyncrasies in performance that must be identified and 
corrected. For example, we noticed that the sizes of ar- 
guments to Linux write () system calls had a dramatic 
impact on their latency; issuing many small writes per 
buffer often yielded more performance than issuing a sin- 
gle large write. One would imagine that providing more 
information about the application’s intended behavior to 
the operating system would result in better management 
of underlying resources and latency but in this case, the 
Opposite seems to be true. While we are still unsure of 
the cause of this behavior, it illustrates that the perfor- 
mance characteristics of operating system services can 
be unpredictable and counter-intuitive. 


5 Evaluation 


We now evaluate TritonSort’s performance and scalabil- 
ity under various hardware configurations. 


5.1 Evaluation Environment 


We evaluated TritonSort on a 52 node cluster of HP 
DL380G6 servers, each with two Intel E5520 CPUs 
(2.27 GHz), 24 GB of memory, and 16 500GB 7,200 
RPM 2.5” SATA drives. Each hard drive is configured 
with a single XFS partition. Each XFS partition is con- 
figured with a single allocation group to prevent file frag- 
mentation across allocation groups, and is mounted with 
the noatime, attr2, nobarrier, and noquota 
flags set. Each server has two HP P410 drive controllers 
with 512MB on-board cache, as well as a Myricom 10 
Gbps network interface. The network interconnect we 
use is a 52-port Cisco Nexus 5020 datacenter switch. The 
servers run Linux 2.6.35.1, and our implementation of 
TritonSort 1s written in C++. 


5.2 Comparison to Alternatives 


The 1OOTB Indy GraySort benchmark was introduced in 
2009, and hence there are few systems against which we 
can compare TritonSort’s performance. The most recent 
holder of the Indy GraySort benchmark, DEMSort [18], 
sorted slightly over 1OOTB of data on 195 nodes at a rate 
of 564 GB per minute. TritonSort currently sorts 1OOTB 
of data on 52 nodes at a rate of 916 GB per minute, a 
factor of six improvement in per-node efficiency. 
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Intermediate Disk | Logical Disks Phase | Phase | Average Write 
Speed (RPM) | Per Physical Disk | Throughput (MBps) Bottleneck Stage Size (MB) 


Writer 


Writer 





LogicalDiskDistributor 


Table 2: Effect of increasing speed of intermediate disks on a two node, 500GB sort 


5.3. Examining Changes in Balance 


We next examine the effect of changing the cluster’s con- 
figuration to support more memory or faster disks. Due 
to budgetary constraints, we could not evaluate these 
hardware configurations at scale. Evaluating the perfor- 
mance benefits of SSDs is the subject of future work. 

In the first experiment, we replaced the SOOGB, 
7200RPM disks that are used as the intermediate disks in 
phase one and the input disks in phase two with 146GB, 
15000RPM disks. The reduced capacity of the drives 
necessitated running an experiment with a smaller input 
data set. To allow space for the logical disks to be pre- 
allocated on the intermediate disks without overrunning 
the disks’ capacity, we decreased the number of logical 
disks per physical disk by a factor of two. This doubles 
the amount of data in each logical disk, but the experi- 
ment’s input data set is small enough that the amount of 
data per logical disk does not overflow the logical disk’s 
maximum size. 

Phase one throughput in these experiments is slightly 
lower than in subsequent experiments because the 30-35 
seconds it takes to write the last few bytes of each logical 
disk at the end of the phase is roughly 10% of the total 
runtime due to the relatively small dataset size. 

The results of this experiment are shown in Table 2. 
We first examine the effect of decreasing the number of 
logical disks without increasing disk speed. Decreas- 
ing the number of logical disks increases the average 
length of LDBuffer chains formed by the LogicalDisk- 
Distributor; note that most of the time, full WriterBuffers 
(14MB) are written to the disks. In addition, halving the 
number of logical disks decreases the number of external 
cylinders that the logical disks occupy, decreasing maxi- 
mal seek latency. These two factors combine together to 
net a significant (11%) increase in phase one throughput. 

The performance gained by writing to 15000 RPM 
disks in phase one is much less pronounced. The main 
reason for this is that the increase in write speed causes 
the Writers to become fast enough that the Logical- 
DiskDistributor exposes itself as the bottleneck stage. 
One side-effect of this is that the LogicalDiskDistributor 
cannot populate WriterBuffers as fast as they become 
available, so it reverts to a pathological case in which 
it always is able to successfully retrieve a write token 
and hence continuously writes minimally-filled (SMB) 
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RAM Per Phase 1 Average Write 
Node (GB) | Throughput (MBps) | Size (MB) 


Table 3: Effect of increasing the amount of memory per 
node on a two node, 2TB sort 
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Figure 11: Throughput when sorting | TB per node as 
the number of nodes increases 


buffers. Creating a LogicalDiskDistributor stage that dy- 
namically adjusts its write size based on write token re- 
trieval success rate is the subject of future work. 

In the next experiment, we doubled the RAM in two 
of the machines in our cluster and adjusted TritonSort’s 
memory allocation by doubling the size of each Writer- 
Buffer (from 14MB to 28MB) and using the remain- 
ing memory (22GB) to create additional LDBuffers. As 
shown in Table 3, increasing the amount of memory al- 
lows for the creation of longer chains of LDBuffers in the 
LogicalDiskDistributor, which in turn causes write sizes 
to increase. The increase in write size is not linear in 
the amount of RAM added; this is likely because we are 
approaching the point past which larger writes will not 
dramatically improve write throughput. 


5.4 TritonSort Scalability 


Figure 11 shows TritonSort’s total throughput when sort- 
ing 1 TB per node as the number of nodes increases from 
2 to 48. Phase two exhibits practically linear scaling, 
which is expected since each node performs phase two 
in isolation. Phase one’s scalability is also nearly linear; 
the slight degradation in its performance at large scales 
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is likely due to network variance that becomes more pro- 
nounced as the number of nodes increases. 


6 Discussion and Future Work 


In this section, we discuss our system and present direc- 
tions for future work. 


6.1 Supporting More General Sorting 


Two assumptions that we make in our design are that tu- 
ples are uniform in size, and that they are uniformly and 
identically distributed across the input files. TritonSort 
can be extended to support non-uniform tuple sizes by 
extending the tuple data structure to keep key and value 
lengths. The most major modification that this will ne- 
cessitate will be supporting the in-memory sort of keys in 
phase two, which will require modifications to the phase 
two Sorter stage. To support the non-uniform distribution 
of keys across input files, we plan to implement a new 
phase that will operate before TritonSort begins in which 
a random small subset of the input data is scanned, de- 
termining a histogram of the key distribution. Using this 
empirical distribution, we will determine a hash function 
that spreads tuples across nodes as uniformly as possible. 


6.2 Automated Performance Tuning 


In the current TritonSort prototype, the sizes of individ- 
ual buffers, the number of buffers of each type, and the 
number of workers implementing each stage are deter- 
mined manually. Key to supporting more general hard- 
ware configurations and more general DISC applications 
is the ability to determine these quantities automatically 
and dynamically. This automatic selection will need to 
be performed both statically at design time, and dynam- 
ically during runtime based on observed conditions. A 
stage’s performance on synthetic data in isolation pro- 
vides a good upper-bound on its real performance and 
makes choosing between different implementations eas- 
ier, but any such synthetic analysis does not take runtime 
conditions such as CPU scheduling and cache contention 
into account. Therefore, some manner of online learning 
algorithm will likely be necessary for the system to de- 
termine a good configuration at scale. 


6.3. Incorporating SSDs into TritonSort 


To achieve nearly sequential-speed throughput to the 
disks, writes must be large. However, limited per-node 
memory capacity and high memory cost makes it hard 
to allocate more than 25MB of memory to each Writer- 
Buffer. Here, we discuss a possible use of SSDs to pro- 
vide high write speeds with much smaller buffers. 

If we were to add three 80GB SSDs to each machine, 
we could setup a pipeline in which these SSDs are di- 
vided between the eight Writers, so that each Writer has 


30 GB of SSD space. The LogicalDiskDistributor passes 
data for each logical disk to the Writer stage in small 
chunks, where Writers write them to the SSDs. Assum- 
ing 315 logical disks per Writer, this gives each logical 
disk 95 MB of space on the SSD. Because the SSD can 
handle such a large number of IOPS, there is no penalty 
for small writes as there is with standard hard drives. 
Once 80 MB of data is written to a single logical disk 
on the SSDs, the Writer initiates a sendfilev() system 
call that causes a sequential DMA transfer of that data 
from the SSD to the appropriate intermediate disk. This 
should lower our memory requirements to 24 GB, while 
permitting extremely large writes. This approach relies 
on two features: significant PCI bandwidth to support 
parallel transfers to the SSDs, and an SSD array present 
in the node able to provide high streaming bandwidth to 
the SSDs; we will need such an array to simultaneously 
support over 640 MBps of parallel writes and 640 MBps 
of parallel reads to fully utilize the disks. 


7 Related Work 


The Datamation sorting benchmark[5] initially measured 
the elapsed time to sort one million records from disk 
to disk. As hardware has improved, the number of 
records has grown to its current level of 1OOTB. Over 
the years, numerous authors have reported the perfor- 
mance of their sorting systems, and we benefit from their 
insights[18, 15, 21, 6, 17, 16]. We differ from previous 
sort benchmark holders in that we focus on maximizing 
both aggregate throughput and per-node efficiency. 

Achieving per-resource balance in a large-scale data 
processing system is the subject of a large volume of 
previous research dating back at least as far as 1970. 
Among the more well-known guidelines for building 
such systems are the Amdahl/Case rules of thumb for 
building balanced systems [3] and Gray and Putzolu’s 
“five-minute rule” [13] for trading off memory and I/O 
capacity. These guidelines have been re-evaluated and 
refreshed as hardware capabilities have increased. 

NOW Sort[6] was the first of the aforementioned sort- 
ing systems to run on a shared-nothing cluster. NOWSort 
employs a two-phase pipeline that generates multiple 
sorted runs in the first phase and merges them together in 
the second phase, a technique shared by DEMSort[18]. 
An evaluation of NOWSort done in 1998[7] found that 
its performance was limited by I/O bus bandwidth and 
poor instruction locality. Modern PCI buses and multi- 
core processors have largely eliminated these concerns; 
in practice, TritonSort is bottlenecked by disk bandwidth. 

TritonSort’s staged, pipelined dataflow architecture is 
inspired in part by SEDA[20], a staged, event-driven 
software architecture that decouples worker stages by in- 
terposing queues between them. Other DISC systems 
such as Dryad [14] export a similar model, although 
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Dryad has fault-tolerance and data redundancy capabili- 
ties that TritonSort does not currently implement. 

We are further informed by lessons learned from par- 
allel database systems. Gamma[10] was one of the first 
parallel database systems to be deployed on a shared- 
nothing cluster. To maximize throughput, Gamma em- 
ploys horizontal partitioning to allow separable queries 
to be performed across many nodes in parallel, an ap- 
proach that is similar in many respects to our use of log- 
ical disks. TritonSort’s Sender-Receiver pair is similar 
to the exchange operator first introduced by Volcano[12] 
in that it abstracts data partitioning, flow control, paral- 
lelism and data distribution from the rest of the system. 


$ Conclusions 


In this work, we describe the hardware and software 
architecture necessary to build TritonSort, a highly ef- 
ficient, pipelined, stage-driven sorting system designed 
to sort tens to hundreds of TB of data. Through care- 
ful management of system resources to ensure cross- 
resource balance, we are able to sort tens of GB of data 
per node per minute, resulting in 916 GB/min across only 
52 nodes. We believe the work holds a number of lessons 
for balanced system design and for scale-out architec- 
tures in general and will help inform the construction of 
more balanced data processing systems that will bridge 
the gap between scalability and per-node efficiency. 
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Abstract 


The causes of performance changes in a distributed 
system often elude even its developers. This paper de- 
velops a new technique for gaining insight into such 
changes: comparing request flows from two executions 
(e.g., of two system versions or time periods). Build- 
ing on end-to-end request-flow tracing within and across 
components, algorithms are described for identifying and 
ranking changes in the flow and/or timing of request pro- 
cessing. The implementation of these algorithms in a 
tool called Spectroscope is evaluated. Six case studies 
are presented of using Spectroscope to diagnose perfor- 
mance changes in a distributed storage service caused by 
code changes, configuration modifications, and compo- 
nent degradations, demonstrating the value and efficacy 
of comparing request flows. Preliminary experiences 
of using Spectroscope to diagnose performance changes 
within select Google services are also presented. 


1 Introduction 


Diagnosing performance problems in distributed systems 
is difficult. Such problems may have many sources and 
may be contained in any one or more of the component 
processes or, more insidiously, may emerge from the in- 
teractions among them [21]. A suite of debugging tools 
is needed to help in identifying and understanding the 
root causes of the diverse types of performance prob- 
lems that can arise. In contrast to single-process appli- 
cations, for which diverse performance debugging tools 
exist (e.g., DTrace [6], gprof [14], and GDB [12]), too 
few techniques have been developed for guiding diagno- 
sis of distributed system performance. 

Recent research has developed promising new tech- 
niques that can help populate the suite. Many build on 
low-overhead end-to-end tracing (e.g., [4, 7, 9, 11, 31, 
34]), which captures the flow (1.e., path and timing) of 
individual requests within and across the components of 
a distributed system. For example, with such rich infor- 
mation about a system’s operation, researchers have de- 
veloped new techniques for detecting anomalous request 
flows [4], spotting large-scale departures from perfor- 
mance models [33], and comparing observed behaviour 
to manually-constructed expectations [26]. 

This paper develops a new technique for the suite: 
comparing request flows between two executions to iden- 
tify why performance has changed between them. Such 
comparison allows one execution to serve as a model 
of acceptable performance; highlighting key differences 


from this model and understanding their performance 
costs allows for easier diagnosis than when only a single 
execution is used. Though obtaining an execution of ac- 
ceptable performance may not be possible in all cases— 
e.g., when a developer wants to understand why perfor- 
mance has always been poor—there are many cases for 
which request-flow comparison is useful. For example, 
it can help diagnose performance changes resulting from 
modifications made during software development (e.g., 
during regular regression testing) or from upgrades to 
components of a deployed system. Also, it can help 
when diagnosing changes over time in a deployed sys- 
tem, which may result from component degradations, re- 
source leakage, or workload changes. 

Our analysis of bug tracking data for a distributed stor- 
age service indicates that more than half of the reported 
performance problems would benefit from guidance pro- 
vided by comparing request flows. Talks with Google 
engineers [3] and experiences using request-flow com- 
parison to diagnose Google services affirm its utility. 

The utility of comparing request flows relies on the 
observation that performance changes often manifest as 
changes in how requests are serviced. When comparing 
two executions, which we refer to as the non-problem 
period (before the change) and the problem period (after 
the change), there will usually be some changes in the 
observed request flows. We refer to new request flows 
in the problem period as mutations and to the request 
flows corresponding to how they were serviced in the 
non-problem period as precursors. Identifying mutations 
and comparing them to their precursors helps localize 
sources of change and gives insight into their effects. 

This paper describes algorithms for effectively com- 
paring request flows across periods, including for iden- 
tifying mutations, ranking them based on their contribu- 
tion to the overall performance change, identifying their 
most likely precursors, highlighting the most prominent 
divergences, and identifying low-level parameter differ- 
ences that most strongly correlate to each. 

We categorize mutations into two types: Response- 
time mutations correspond to requests that have in- 
creased only in cost between the periods; their precursors 
are requests that exhibit the same structure, but whose re- 
sponse time is different. Structural mutations correspond 
to requests that take different paths through the system in 
the problem period. Identifying their precursors requires 
analysis of all request flows with differing frequencies in 
the two periods. 
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Figure | illustrates a (mocked up) example of two mu- 
tations and their precursors. Ranking and highlighting 
divergences involves using statistical tests and compari- 
son of mutations and associated precursors. 

We have implemented request-flow comparison in a 
toolset called Spectroscope and used it to diagnose per- 
formance problems observed in Ursa Minor [1], a dis- 
tributed storage service. By describing five real problems 
and one synthetic one, we illustrate the utility of compar- 
ing request flows and show that our algorithms enable 
effective use of this technique. To understand challenges 
associated with scaling request-flow comparison to very 
large distributed systems, this paper also describes pre- 
liminary experiences using it to diagnose performance 
changes within distributed services at Google. 


2 End-to-end request-flow tracing 


Request-flow comparison builds on end-to-end tracing, 
an invaluable information source that captures a dis- 
tributed system’s performance and control flow in detail. 
Such tracing works by capturing activity records at each 
of various trace points within the distributed system’s 
software, with each record identifying the specific trace- 
point name, the current time, and other contextual infor- 
mation. Most implementations associate activity records 
with individual requests by propagating a per-request 
identifier, which is stored within the record. Activity 
records can be stitched together, either offline or online, 
to yield request-flow graphs, which show the control flow 
of individual requests. Several efforts, including Mag- 
pie [4], Whodunit [7], Pinpoint [9], X-Trace [10, 11], 
Google’s Dapper [31], and Stardust [34] have indepen- 
dently implemented such tracing and shown that it can be 
used continuously with low overhead, especially when 
request sampling is supported [10, 28, 31]. For example, 
Stardust [34], Ursa Minor’s end-to-end tracing mecha- 
nism, adds 1% or less overhead when used with key 
benchmarks, such as SpecSFS [30]. 

End-to-end tracing implementations differ in two key 
respects: whether instrumentation is added automatically 
or manually and whether the request flows can disam- 
biguate sequential and parallel activity. With regard to 
the latter, Magpie [4] and recent versions of both Star- 
dust [34] and X-Trace [10] explicitly account for concur- 
rency by embedding information about thread synchro- 
nization in their traces (see Figure 2). These implemen- 
tations are a natural fit for request-flow comparison, as 
they can disambiguate true structural differences from 
false ones caused by alternate interleavings of concurrent 
activity. Whodunit [7], Pinpoint [9], and Dapper [31] do 
not account for parallelism. 

End-to-end tracing in distributed systems is past the 
research stage. For example, it is used in production 
Google datacenters [31] and in some production three- 
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Figure 1: Example output from comparing request flows. 
The two mutations shown are ranked by their effect on the 
change in performance. The item ranked first is a structural 
mutation and the item ranked second is a response-time muta- 
tion. Due to space constraints, mocked-up graphs are shown in 
which nodes represent the type of component accessed. 


tier systems [4]. Research continues, however, on how to 
best exploit the information provided by such tracing. 


3 Behavioural changes vs. anomalies 


Our technique of comparing request flows between two 
periods identifies distribution changes in request-flow 
behaviour and ranks them according to their contribu- 
tion to the observed performance difference. Conversely, 
anomaly detection techniques, as implemented by Mag- 
pie [4] and Pinpoint [9], mine a single period’s request 
flows to identify rare ones that differ greatly from oth- 
ers. In contrast to request-flow comparison, which at- 
tempts to identify the most important differences be- 
tween two sets, anomaly detection attempts to identify 
rare elements within a single set. 

Request-flow comparison and anomaly detection serve 
distinct purposes, yet both are useful. For example, per- 
formance problems caused by changes in the compo- 
nents used (e.g., see Section 8.2), or by common requests 
whose response times have increased slightly, can be eas- 
ily diagnosed by comparing request flows, whereas many 
anomaly detection techniques will be unable to provide 
guidance. In the former case, guidance will be diffi- 
cult because the changed behaviour is common during 
the problem period; in the latter, because the per-request 
change is not extreme enough. 


4 Spectroscope 


To illustrate the utility of comparing request flows, this 
technique was implemented in a tool called Spectroscope 
and used to diagnose performance problems seen in Ursa 
Minor [1] and in certain Google services. This section 
provides an overview of Spectroscope, and the next de- 
scribes its algorithms. Section 4.1 describes how cate- 
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Figure 2: Example request-flow graph. The graph shows 
a striped READ in the Ursa Minor distributed storage system. 
Nodes represent trace points and edges are labeled with the 
time between successive events. Parallel substructures show 
concurrent threads of activity. Node labels are constructed by 
concatenating the machine name (e.g., e10), component name 
(e.g., NFS3), trace-point name (e.g., READ_CALL_TYPE), and an 
optional semantic label (e.g., NFSCACHE_READ_MISS). Due to 
space constraints, trace points executed on other components 
as a result of the NFS server’s RPC calls are not shown. 





gories, the basic building block on which Spectroscope 
operates, are constructed. Section 4.2 describes Spectro- 
scope’s support for comparing request flows. 


4.1 Categorizing request flows 


Even small distributed systems can service hundreds to 
thousands of requests per second, so comparing all of 
them individually is not feasible. Instead, exploiting 
a general expectation that requests that take the same 
path should incur similar costs, Spectroscope groups 
identically-structured requests into unique categories 
and uses them as the basic unit for comparing request 
flows. For example, requests whose structures are identi- 
cal because they hit in a NFS server’s data and metadata 
cache will be grouped into the same category, whereas 
requests that miss in both will be grouped in a differ- 
ent one. Two requests are deemed structurally identical 
if their string representations, as determined by a depth- 
first traversal, are identical. For requests with parallel 
substructures, Spectroscope computes all possible string 
representations when determining the category in which 
to bin them. The exponential cost is mitigated by im- 
posing an order on parallel substructures (i.e., by always 
traversing them in alphabetical order, as determined by 
their root node names) and by the fact that parallelism is 
limited in most request flows we have observed. 

For each category, Spectroscope identifies aggregate 
statistics, including request count, average response 


time, and variance. To identify where time is spent, it 
also computes average edge latencies and correspond- 
ing variances. Spectroscope displays categories in ei- 
ther a graph view, with statistical information overlaid, 
or within train-schedule visualizations [37] (also known 
as swim lanes), which more directly show the constituent 
requests’ pattern of activity. 

Spectroscope uses selection criteria to limit the num- 
ber of categories developers must examine. For exam- 
ple, when comparing request flows, statistical tests and 
a ranking scheme are used. The number of categories 
could be further reduced by using unsupervised clus- 
tering algorithms, such as those used in Magpie [4], to 
bin similar but not necessarily identical requests into the 
same category. Initial versions of Spectroscope used 
off-the-shelf clustering algorithms [29], but we found 
the groups they created too coarse-grained and unpre- 
dictable. Often, they would group mutations and pre- 
cursors within the same category, masking their exis- 
tence. For clustering algorithms to be useful, improve- 
ments such as distance metrics that better align with de- 
velopers’ notions of request similarity are needed. With- 
out them, use of clustering algorithms will result in cate- 
gories composed of seemingly dissimilar requests. 


4.2 Comparing request flows 


Performance changes can result from a variety of factors, 
such as internal changes to the system that result in per- 
formance regressions, unintended side effects of changes 
to configuration files, or environmental issues. Spectro- 
scope helps diagnose these problems by comparing re- 
quest flows and identifying the key resulting mutations. 
Figure 3 shows Spectroscope’s workflow. 

When comparing request flows, Spectroscope takes as 
input request-flow graphs from two periods of activity, 
which we refer to as a non-problem period and a prob- 
lem period. It creates categories composed of requests 
from both periods and uses statistical tests and heuristics 
to identify which contain structural mutations, response- 
time mutations, or precursors. Categories containing mu- 
tations are presented to the developer in a list ranked by 
expected contribution to the performance change. Note 
that the periods do not need to be aligned exactly with 
the performance change (e.g., at Google we often chose 
day-long periods based on historic average latencies). 

Visualizations of categories that contain mutations 
are similar to those described previously, except per- 
period statistical information is shown. The root cause 
of response-time mutations is localized by showing the 
edges responsible for the mutation in red. The root cause 
of structural mutations is localized by providing a ranked 
list of the candidate precursors, so that the developer can 
determine how they differ. Figure | shows an example. 

Spectroscope provides further insight into perfor- 
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Figure 3: Spectroscope’s workflow for comparing request 
flows. First, Spectroscope groups requests from both periods 
into categories. Second, it identifies which categories contain 
mutations or precursors. Third, it ranks mutation categories 
according to their expected contribution to the performance 
change. Developers are presented this ranked list. Visualiza- 
tions of mutations and their precursors can be shown. Also, 
low-level differences can be identified for them. 


mance changes by identifying the low-level parameters 
(e.g., client parameters or function call parameters) that 
best differentiate a chosen mutation and its precursors. 
For example, in Ursa Minor, one performance slow- 
down, which manifested as many structural mutations, 
was caused by a change in a parameter sent by the client. 
For problems like this, highlighting the specific low-level 
differences can immediately identify the root cause. 
Section 5 describes Spectroscope’s algorithms and 
heuristics for identifying mutations, their corresponding 
precursors, their rank based on their relative influence 
on the overall performance change, and their most rele- 
vant low-level parameter differences. It also describes 
how these methods overcome key challenges—for ex- 
ample, differentiating true mutations from natural vari- 
ance in request structure and timings. Identification of 
response-time mutations and ranking rely on the expecta- 
tion (reasonable for many distributed systems, including 
distributed storage) that requests that take the same path 
through a distributed system will exhibit similar response 
times and edge latencies. Section 7 describes how high 
variance in this axis affects Spectroscope’s results. 


5 Algorithms for comparing request flows 


This section describes the key heuristics and algorithms 
used when comparing request flows. In creating them, 
we favoured simplicity and those that regulate false 
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positives—perhaps the worst failure mode due to devel- 
oper effort wasted—whenever possible. 


5.1 Identifying response-time mutations 


When comparing two periods, there will always be some 
natural differences in timings. Spectroscope uses the 
Kolmogorov-Smirnov two-sample, non-parametric hy- 
pothesis test [20] to differentiate natural variance from 
true changes in distribution or behaviour. Statistical hy- 
pothesis tests take as input two distributions and output 
a p-value, which represents uncertainty in the claim that 
the null hypothesis, that both distributions are the same, 
is false. Expensive false positives are limited to a preset 
rate (almost always 5%) by rejecting the null hypothe- 
sis only when the p-value is lower than this value. The 
p-value increases with variance and decreases with the 
number of samples. A non-parametric test, which does 
not require knowledge of the underlying distribution, is 
used because we have observed that response times are 
not governed by well-known distributions. 

The Kolmogorov-Smirnov test is used as follows. For 
each category, the distributions of response times for 
the non-problem period and the problem period are ex- 
tracted and input into the hypothesis test. The category 
is marked as containing response-time mutations if the 
test rejects the null hypothesis. By default, categories 
that contain too few requests to run the test accurately 
are not marked as containing mutations. To identify the 
components or interactions responsible for the mutation, 
Spectroscope extracts the critical path—t.e., the path of 
the request on which response time depends—and runs 
the same hypothesis test on the edge latency distribu- 
tions. Edges for which the null hypothesis is rejected 
are marked in red in the final output visualization. 


5.2 Identifying structural mutations 


To identify structural mutations, Spectroscope assumes a 
similar workload was run in both the non-problem period 
and the problem period. As such, it is reasonable to ex- 
pect that an increase in the number of requests that take 
one path through the distributed system in the problem 
period should correspond to a decrease in the number of 
requests that take other paths. Since non-determinism 
in service order dictates that per-category counts will al- 
ways vary slightly between periods, a threshold is used 
to identify categories that contain structural mutations 
and precursors. Categories that contain SM_THRESHOLD 
more requests from the problem period than from the 
non-problem period are labeled as containing mutations 
and those that contain SM_THRESHOLD fewer are labeled 
as containing precursors. 

Choosing a good threshold for a workload may require 
some experimentation, as it is sensitive to both the num- 
ber of requests issued and the sampling rate. Fortunately, 
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it is easy to run Spectroscope multiple times, and it is not 
necessary to get the threshold exactly right—choosing a 
value that is too small will result in more false positives, 
but they will be given a low rank and so will not mislead 
the developer in his diagnosis efforts. 


If per-category distributions of request counts are 
available, a statistical test, instead of a threshold, could 
be used to determine those categories that contain mu- 
tations or precursors. This statistical approach would 
be superior to a threshold-based approach, as it guar- 
antees a set false-positive rate. However, building the 
distributions necessary would require obtaining many 
non-problem and problem-period datasets, so we opted 
for the simpler threshold-based approach instead. Also, 
our experiences at Google indicate that request structure 
within large datacenters may change too quickly for such 
expensive-to-build models to be useful. 


5.3. Mapping mutations to precursors 


Once the total set of categories that contain structural 
mutations and precursors has been identified, Spectro- 
scope must iterate through each structural-mutation cate- 
gory to determine the precursor categories that are likely 
to have donated requests to it. This is accomplished via 
three heuristics, described below. Figure 4 shows how 
they are applied. 


First, the total list of precursor categories is pruned to 
eliminate categories with a different root node than those 
in the structural-mutation category. The root node de- 
scribes the overall type of a request, for example READ, 
WRITE, or READDIR, and requests of different high-level 
types should not be precursor/mutation pairs. 


Second, remaining precursor categories that have de- 
creased in request count less than the increase in re- 
quest count of the structural-mutation category are also 
removed from consideration. This 1:N heuristic reflects 
the common case that one precursor category is likely to 
donate requests to N structural-mutation categories. For 
example, a cache-related problem may result in a portion 
of requests that used to hit in that cache to miss and hit 
in the next-level cache. Extra cache pressure at this next- 
level cache may result in the rest missing in both caches. 
This heuristic can be optionally disabled. 


Third, the remaining precursor categories are ranked 
according to their likelihood of having donated requests, 
as determined by the string-edit distance between them 
and the structural-mutation category. This heuristic re- 
flects the intuition that precursors and structural muta- 
tions are likely to resemble each other in structure. The 
cost of computing the edit distance is O(VM), where N 
and M are the lengths of the string representations of the 
categories being compared. 


Mutation 


Precursor categories 


NP: 1,100 


P: 600 





Figure 4: How the precursor categories of a structural- 
mutation category are identified. One structural-mutation 
category and five precursor categories are shown, each with 
their corresponding request counts from the non-problem (NP) 
and problem (P) periods. For this case, the shaded precursor 
categories will be identified as those that could have donated 
requests to the structural-mutation category. The precursor cat- 
egories that contain LOOKUP and READDIR requests cannot 
have donated requests because their constituent requests are not 
READS. The top left-most precursor category contains READS, 
but the 1:N heuristic eliminates it. 


5.4 Ranking 


Ranking of mutations is necessary for two reasons. 
First, the performance problem might have multiple root 
causes, each of which causes its own set of mutations. 
Second, even if there is only one root cause to the prob- 
lem (e.g., a misconfiguration), many mutations will often 
still be observed. For both cases, it is useful to identify 
the mutations that most affect performance in order to fo- 
cus diagnosis effort where it will yield the most benefit. 


Spectroscope ranks categories that contain mutations 
in descending order by their expected contribution to the 
performance change. The contribution for a structural- 
mutation category is calculated as the number of mu- 
tations it contains, which is the difference between its 
problem and non-problem period counts, multiplied by 
the difference in problem period average response time 
between it and its precursor categories. If more than 
one candidate precursor category has been identified, a 
weighted average of their average response times is used; 
weights are based on structural similarity to the muta- 
tion. The contribution for a response-time-mutation cat- 
egory is calculated as the number of mutations it con- 
tains, which is just the non-problem period count, times 
the change in average response time of that category be- 
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tween the periods. If a category contains both response- 
time mutations and structural mutations, it is split into 
two virtual categories and each is ranked separately. 


5.5 Identifying low-level differences 


Identifying the differences in low-level parameters be- 
tween a mutation and precursor can often help develop- 
ers further localize the source of the problem. For ex- 
ample, the root cause of a response-time mutation might 
be further localized by identifying that it is caused by a 
component that is sending more data in its RPCs than 
during the non-problem period. 

Spectroscope allows developers to pick a mutation 
category and candidate precursor category for which to 
identify low-level differences. Given these categories, 
Spectroscope induces a regression tree [5] showing the 
low-level parameters that best separate requests in these 
categories. Each path from root to leaf represents an 
independent explanation of why the mutation occurred. 
Since developers may already possess some intuition 
about what differences are important, the process is 
meant to be interactive. If the developer does not like 
the explanations, he can select a new set by removing the 
root parameter from consideration and re-running the al- 
gorithm. 

The regression tree is induced as follows. First, a 
depth-first traversal is used to extract a template describ- 
ing the parts of request structures that are common be- 
tween both categories, up until the first observed differ- 
ence. Portions that are not common are excluded, since 
low-level parameters cannot be compared for them. 

Second, a table in which rows represent requests and 
columns represent parameters is created by iterating 
through each of the categories’ requests and extracting 
parameters from the parts that fall within the template. 
Each row is labeled as belonging to the problem or non- 
problem period. Certain parameter values, such as the 
thread ID and timestamp, must always be ignored, as 
they are not expected to be similar across requests. Fi- 
nally, the table is fed as input to the C4.5 algorithm [25], 
which creates the regression tree. To reduce the runtime, 
only parameters from a randomly sampled subset of re- 
quests are extracted from the database, currently a min- 
imum of 100 and a maximum of 10%. Parameters only 
need to be extracted the first time the algorithm is run; 
subsequent iterations can modify the table directly. 


5.6 Current limitations 


This section describes current limitations with our tech- 
niques for comparing request flows. 

Diagnosing problems caused by contention: Our 
techniques assume that performance changes are caused 
by changes to the system (code changes, configura- 
tion changes, etc). Though they will identify mutations 
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caused by contention, they cannot directly attribute them 
to the responsible process. In some cases our techniques 
can indirectly help—tfor example, by showing that many 
edges within a component are responsible for a response- 
time mutation, they can help the developer intuit that the 
problem is due to contention with an external process. 

Diagnosing problems when the load differs signifi- 
cantly between periods: In such cases, the load change 
itself may be the root cause. Though our techniques 
will identify response-time and structural changes when 
the load during the problem period is much greater than 
the non-problem period, the developer must determine 
whether they are reasonable degradations. 


6 Experimental apparatus 


Most of the experiments and case studies reported in this 
paper come from using Spectroscope with a distributed 
storage service called Ursa Minor. Section 6.1 describes 
this system. Section 6.2 describes the benchmarks used 
for Ursa Minor’s nightly regression tests, the setting in 
which many of the case studies were observed. 

To understand issues in scaling request-flow compari- 
son to larger systems, we also used Spectroscope to diag- 
nose services within Google. Section 6.3 provides more 
details. The implementation of Spectroscope for Ursa 
Minor was written in Perl and MATLAB. It includes a 
visualization layer built upon Prefuse [16]. The cost of 
calculating edit distances dominates its runtime, so it is 
sensitive to the value of SM_THRESHOLD used. The imple- 
mentation for Google was written in C++; its runtime is 
much lower (on the order of seconds) and its visualiza- 
tion layer uses DOT [15]. 


6.1 Ursa Minor 


Figure 5 illustrates Ursa Minor’s architecture. Like most 
modern scalable distributed storage, Ursa Minor sep- 
arates metadata services from data services, such that 
clients can access data on storage nodes without mov- 
ing it all through metadata servers. An Ursa Minor 
instance (called a “constellation”) consists of poten- 
tially many NFS servers (for unmodified clients), stor- 
age nodes (SNs), metadata servers (MDSs), and end-to- 
end-trace servers. To access data, clients must first send 
a request to a metadata server asking for the appropri- 
ate permissions and locations of the data on the storage 
nodes. Clients can then access the storage nodes directly. 

Ursa Minor has been in active development since 2004 
and comprises about 230,000 lines of code. More than 20 
graduate students and staff have contributed to it over its 
lifetime. More details about its implementation can be 
found in Abd-El-Malek et al. [1]. 

The components of Ursa Minor are usually run on sep- 
arate machines within a datacenter. Though Ursa Minor 
supports an arbitrary number of components, the experi- 
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Figure 5: Ursa Minor Architecture. Ursa Minor can be 
deployed in many configurations, with an arbitrary number of 
NFS servers, metadata servers, storage nodes (SNs), and trace 
servers. Here, a simple five-component configuration is shown. 


ments and case studies detailed in this paper use a simple 
five-machine configuration: one NFS server, one meta- 
data server, one trace server, and two storage nodes. One 
storage node stores data, while the other stores metadata. 
Not coincidentally, this is the configuration used in the 
nightly regression tests that uncovered many of the prob- 
lems described in the case studies. 

End-to-end tracing infrastructure via Stardust: 
Ursa Minor’s Stardust tracing infrastructure is much like 
its peer group, discussed in Section 2. Request sampling 
is used to capture trace data for a subset of entire requests 
(10% by default), with a per-request decision made ran- 
domly when the request enters the system. Ursa Minor 
contains approximately 200 trace points, 124 manually 
inserted as well as automatically generated ones for each 
RPC send and receive function. In addition to simple 
trace points, which indicate points reached in the code, 
explicit split and join trace points are used to identify the 
start and end of concurrent threads of activity. Low-level 
parameters are also collected at trace points. 


6.2 Benchmarks used with Ursa Minor 


Experiments run on Ursa Minor use these benchmarks. 
Linux-build and ursa minor-build: These bench- 
marks consist of two phases: a copy phase, in which the 
source tree is tarred and copied to Ursa Minor and then 
untarred, and a build phase, in which the source files 
are compiled. Linux-build (of 2.6.32 kernel) runs for 
26 minutes. About 145,000 requests are sampled. The 
average graph size and standard deviation is 12 and 40 
nodes. Most graphs are small, but some are very big, 
so the per-category equivalents are larger: 160 and 500 
nodes. Ursa minor-build runs for 10 minutes. About 
16,000 requests are sampled and the average graph size 
and standard deviation is 9 and 28 nodes. The per- 
category equivalents are 96 and 100 nodes. 
Postmark-large: This synthetic benchmark evalu- 





ates the small file performance of storage systems [19]. 
It utilizes 448 subdirectories, 50,000 transactions, and 
200,000 files and runs for 80 minutes. The average graph 
size and standard deviation is 66 and 65 nodes. The per- 
category equivalents are 190 and 81 nodes. 

SPEC SFS 97 V3.0 (SFS97): This synthetic bench- 
mark is the industry standard for measuring NFS server 
scalability and performance [30]. It applies a period- 
ically increasing load of NFS operations to a storage 
system’s NFS server and measures the average response 
time. It was configured to generate load between 50 and 
350 operations/second in increments of 50 ops/second 
and runs for 90 minutes. The average graph size and 
standard deviation is 30 and 51 nodes. The per-category 
equivalents are 206 and 200 nodes. 

ToZone: This benchmark [23] sequentially writes, re- 
writes, reads, and re-reads a 5GB file in 20 minutes. The 
average graph size and standard deviation is 6 nodes. The 
per-category equivalents are 61 and 82 nodes. 


6.3. Dapper & Google services 


The Google services for which Spectroscope was ap- 
plied were instrumented using Dapper, which automati- 
cally embeds trace points in Google’s RPC framework. 
Like Stardust, Dapper employs request sampling, but 
uses a sampling rate of less than 0.1%. Spectroscope 
was implemented as an extension to Dapper’s aggrega- 
tion pipeline, which groups individual requests into cat- 
egories and was originally written to support Dapper’s 
pre-existing analysis tools. Categories created by the 
aggregation pipeline only show compressed call graphs 
with identical children and siblings merged together. 


7 Dealing with high-variance categories 


For automated diagnosis tools to be useful, it is important 
that distributed systems satisfy certain properties about 
variance. For Spectroscope, categories that exhibit high 
variance in response times and edge latencies do not sat- 
isfy the expectation that “requests that take the same path 
should incur similar costs” and can affect its ability to 
identify mutations accurately. Spectroscope’s ability to 
identify response-time mutations is sensitive to variance, 
whereas for structural mutations only the ranking is af- 
fected. Though categories may exhibit high variance in- 
tentionally (for example, due to a scheduling algorithm 
that minimizes mean response time at the expense of 
variance), many do so unintentionally, as a result of latent 
performance problems. For example, in early versions 
of Ursa Minor, several high-variance categories resulted 
from a poorly written hash table that exhibited slowly in- 
creasing lookup times because of a poor hashing scheme. 

For response-time mutations, both false negatives 
and false positives will increase with the number of 
high-variance categories. False negatives will increase 
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because high variance will reduce the Kolmogorov- 
Smirnov test’s power to differentiate true behaviour 
changes from natural variance. False positives, which are 
much rarer, will increase when it is valid for categories to 
exhibit similar response times within a single period, but 
different response times across different ones. The rest 
of this section concentrates on the false negative case. 
To quantify how well categories meet the same 
path/similar costs expectation within a single period, 
Figure 6 shows a CDF of the squared coefficient of vari- 
ation in response time (C”) for large categories induced 
by linux-build, postmark-large, and SFS97 in Ursa 
Minor. Figure 7 shows the same C* CDF for large cat- 
egories induced by Bigtable [8] running in three Google 
datacenters over a 1-day period. Each Bigtable instance 
is shared among the machines in its datacenter and ser- 
vices several workloads. C? is a normalized measure of 
variance and is defined as Cie Distributions with C? 


less than one exhibit low variance, whereas those with C? 
greater than one exhibit high variance. Large categories 
contain more than 10 requests; Tables 1 and 2 show that 
they account for only 15-45% of all categories, but con- 
tain more than 98% of all requests. Categories contain- 
ing fewer requests are not included, since their smaller 
sample size makes the C? statistic unreliable for them. 

For the benchmarks run on Ursa Minor, at least 88% 
of the large categories exhibit low variance. C? for all 
the categories generated by postmark-large is small. 
More than 99% of its categories exhibit low variance and 
the maximum C” value observed is 6.88. The results for 
linux-build and SFS97 are slightly more heavy-tailed. 
For linux-build, 96% of its categories exhibit low vari- 
ance, and the maximum C” value is 394. For SFS97, 
88% exhibit C* less than 1, and the maximum C? value 
is 50.3. Analysis of categories in the large tail of these 
benchmarks show that part of the observed variance is a 
result of contention for locks in the metadata server. 

The traces collected for Bigtable by Dapper are rel- 
atively sparse—often graphs generated for it are com- 
posed of only a few nodes, with one node showing the 
incoming call type (e.g., READ, MUTATE, etc.) and an- 
other showing the call type of the resulting GFS [13] re- 
quest. As such, many dissimilar paths cannot be disam- 
biguated and have been merged together in the observed 
categories. Even so, 47-69% of all categories exhibit C 
less than 1. Additional instrumentation, such as those 
that show the sizes of Bigtable data requests and work 
done within GFS, would serve to further disambiguate 
unique paths and considerably reduce C’. 


$8 Ursa Minor case studies 


Spectroscope is not designed to replace developers; 
rather it is intended to serve as an important step in the 
workflow they use to diagnose problems. Sometimes 
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it can help developers identify the root cause immedi- 
ately, or at least localize the problem to a specific area 
of the system. In other cases, it can help eliminate the 
distributed system as the root cause by verifying that its 
behaviour has not changed, allowing developers to focus 
their efforts on external factors. 

This section presents diagnoses of six performance 
problems solved by using Spectroscope to compare re- 
quest flows and analyzes its effectiveness in identifying 
the root causes. Most of these problems were previously 
unsolved and diagnosed by the authors without knowl- 
edge of the root cause. One problem was observed before 
Spectroscope was available, so it was re-injected to show 
how effectively it could have been diagnosed. By intro- 
ducing a synthetic spin loop of different delays, we also 
demonstrate Spectroscope’s ability to diagnose changes 
in response time. 


8.1 Methodology 


Three complementary metrics are provided for evaluat- 
ing Spectroscope’s output. 

The percentage of the 10 highest-ranked categories 
that are relevant: This metric measures the quality of 
the rankings of the results. It accounts for the fact that 
developers will naturally investigate the highest-ranked 
categories first, so it is important for them to be relevant. 

The percentage of false-positive categories: This 
metric evaluates the quality of the ranked list by iden- 
tifying the percentage of all results that are not relevant. 

Request coverage: This metric evaluates quality of 
the ranked list by identifying the percentage of requests 
affected by the problem that are identified in it. 

Table 3 summarizes Spectroscope’s performance us- 
ing these metrics. Unless otherwise noted, a default 
value of 50 was used for SM_THRESHOLD. We chose this 
value to yield reasonable runtimes (between 15-30 min- 
utes) when diagnosing problems in larger benchmarks, 
such as SFS97 and postmark-large. When necessary, 
it was lowered to further explore the space of possible 
structural mutations. 


8.2 MDS configuration change 


After a particular large code check-in, performance of 
postmark-large decayed significantly, from 46tps to 
28tps. To diagnose this problem, we used Spectro- 
scope to compare request flows between two runs of 
postmark-large, one from before the check-in and one 
from after. The results showed many categories that con- 
tained structural mutations. Comparing them to their 
most-likely precursor categories revealed that the stor- 
age node utilized by the metadata server had changed. 
Before the check-in, the metadata server wrote metadata 
only to its dedicated storage node. After the check-in, it 
issued most writes to the data storage node instead. We 
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Figure 6: CDF of C’ for large categories induced by three 
benchmarks run on Ursa Minor. At least 88% of the cate- 
gories induced by each benchmark exhibit low variance (C” < 
1). The results for 1inux-build and SFS are more heavy-tailed 
than postmark-large, partly due to extra lock contention in 
the metadata server. 


Benchmark 
binuxsbld. Postitark pro?) 
Categories 351 716 1602 
Large categories (%) 25.3 299 14.7 
Requests sampled 145,167 131,113 210,669 
In large categories (%) 99.7 992 98.9 


Table 1: Distribution of requests in the categories induced 
by three benchmarks run on Ursa Minor. Though many cat- 
egories are generated, most contain only a small number of re- 
quests. Large categories, which contain more than 10 requests, 
account for between 15—29% of all categories generated, but 
contain over 99% of all requests. 


also used Spectroscope to identify the low-level param- 
eter differences between a few structural-mutation cate- 
gories and their corresponding precursor categories. The 
regression tree found differences in elements of the data 
distribution scheme (e.g., type of fault tolerance used). 


We presented this information to the developer of the 
metadata server, who told us the root cause was a change 
in an infrequently-modified configuration file. Along 
with the check-in, he had mistakenly removed a few 
lines that pre-allocated the file used to store metadata and 
specify the data distribution. Without this, Ursa Minor 
used its default distribution scheme and sent all writes to 
the data storage node. The developer was surprised to 
learn that the default distribution scheme differed from 
the one he had chosen in the configuration file. 


Summary: For this real problem, comparing re- 
quest flows helped developers diagnose a performance 
change caused by modifications to the system configura- 
tion. Many distributed systems contain large configura- 
tion files with esoteric parameters (e.g., hadoop-site.xml) 
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Figure 7: CDF of C?’ for large categories induced by 
Bigtable instances in three Google datacenters. Dapper’s in- 
strumentation of Bigtable is sparse, so many paths cannot be 
disambiguated and have been merged together in the observed 
categories, resulting in a higher C” than otherwise expected. 
Even so, 47-69% of categories exhibit low variance. 


Google datacenter 


A B G 
Categories 29 24 17 
Large categories (%) 32.6 45.2 26.9 
Requests sampled 7,088 5,556 2,079 
In large categories (%) oid 98.8 93.1 


Table 2: Distribution of requests in the categories induced 
by three instances of Bigtable over a 1-day period. Fewer 
categories and requests are observed than for Ursa Minor, be- 
cause Dapper samples less than 0.1% of all requests. The distri- 
bution of requests within categories 1s similar to Ursa Minor—a 
small number of categories contain most requests. 


that, if modified, can result in perplexing performance 
changes. Spectroscope can provide guidance in such 
cases by showing how various configuration options af- 
fect system behaviour. 

Quantitative analysis: For the evaluation in Table 3, 
results in the ranked list were deemed relevant if they 
included metadata accesses to the data storage node with 
a most-likely precursor category that included metadata 
accesses to the metadata storage node. 


8.3. Read-modify-writes 


This problem was observed and diagnosed before devel- 
opment on Spectroscope began; it was re-injected in Ursa 
Minor to show how Spectroscope could have helped de- 
velopers easily diagnose it. 

A few years ago, performance of IoZone declined 
from 22MB/s to 9MB/s after upgrading the Linux ker- 
nel from 2.4.22 to 2.6.16.11. To debug this problem, 
one of the authors of this paper spent several days manu- 
ally mining Stardust traces and eventually discovered the 
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Quality of results 


#/ Type Name Manifestation Root cause # of Top 10 FPs(%) Cov. (%) 
results rel. (%) 
8.2 / Real MDS config. Structural Config. change 128 100 Z 70 
8.3 / Real RMWs Structural Env. change 3 100 0 100 
8.4 / Real MDS prefetch. 50 — Structural Internal change i 29 71 93 
8.4 / Real MDS prefetch. 10 16 70 56 96 
8.5 / Real Create behaviour Structural Design problem 11 AO 64 N/A 
8.6/ Synthetic 100us delay Response time Internal change 17 0 100 0 
8.6/ Synthetic S500us delay 166 100 6 92 
8.6/ Synthetic ms delay 178 100 7 i) 
8.7 / Real Periodic spikes No change Env. change N/A N/A N/A N/A 


Table 3: Overview of the Ursa Minor case studies. This table shows information about each of six problems diagnosed using 
Spectroscope. For most of the case studies, quantitative metrics that evaluate the quality of Spectroscope’s results are included. 


root cause: the new kernel’s NFS client was no longer 
honouring the NFS server’s preferred READ and WRITE 
I/O sizes, which were set to 16KB. The smaller I/O sizes 
used by the new kernel forced the NES server to per- 
form many read-modify-writes (RMWs), which severely 
affected performance. To remedy this issue, support for 
smaller I/O sizes was added to the NFS server and coun- 
ters were added to track the frequency of RMWs. 

To show how comparing request flows and identifying 
low-level parameter differences could have helped devel- 
opers quickly identify the root cause, Spectroscope was 
used to compare request flows between a run of IoZone 
in which the Linux client’s I/O size was set to 16KB 
and another during which the Linux client’s I/O size was 
set to 4KB. All of the results in the ranked list were 
structural-mutation categories that contained RMWs. 

We next used Spectroscope to identify the low-level 
parameter differences between the highest-ranked result 
and its most-likely precursor category. The output per- 
fectly separated the constituent requests by the count pa- 
rameter, which specifies the amount of data to be read or 
written by the request. Specifically, requests with count 
parameter values less than or equal to 4KB were classi- 
fied as belonging to the problem period. 

Summary: Diagnosis of this problem demonstrates 
how comparing request flows can help developers iden- 
tify performance problems that arise due to a workload 
change. It also showcases the utility of highlighting rel- 
evant low-level parameter differences. 

Quantitative analysis: For Table 3, results in the 
ranked list were deemed relevant if they contained 
RMWs and their most-likely precursor category did not. 


8.4 MODS prefetching 


A few years ago, several developers, including one of 
the authors of this paper, tried to add server-driven meta- 
data prefetching to Ursa Minor [17]. This feature was in- 
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tended to improve performance by prefetching metadata 
to clients on every mandatory metadata server access, in 
hopes of minimizing the total number of accesses neces- 
sary. However, when implemented, this feature provided 
no improvement. The developers spent a few weeks (off 
and on) trying to understand the reason for this unex- 
pected result but eventually moved on to other projects 
without an answer. 


To diagnose this problem, we compared two runs of 
linux-build, one with prefetching disabled and another 
with it enabled. 1linux-build was chosen because it 
is more likely to see performance improvements due to 
prefetching than the other workloads. 


When we ran Spectroscope with SM_THRESHOLD set 
to 50, several categories were identified as contain- 
ing mutations. The two highest-ranked results imme- 
diately piqued our interest, as they contained WRITEs 
that exhibited an abnormally large number of lock ac- 
quire/release accesses within the metadata server. All 
of the remaining results contained response-time muta- 
tions from regressions in the metadata prefetching code 
path, which had not been properly maintained. To further 
explore the space of structural mutations, we decreased 
SM_THRESHOLD to 10 and re-ran Spectroscope. This time, 
many more results were identified; most of the highest- 
ranked ones now exhibited an abnormally high number 
of lock accesses and differed only in the exact number. 


Analysis revealed that the additional lock/unlock calls 
reflected extra work performed by requests that accessed 
the metadata server to prefetch metadata to clients. To 
verify this as the root cause, we added instrumentation 
around the prefetching function to see whether it caused 
the database accesses. Altogether, this information pro- 
vided us with the intuition necessary to determine why 
server-driven metadata prefetching did not improve per- 
formance: the extra time spent in the DB calls by meta- 
data server accesses outweighed the time savings gener- 
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ated by the increase in client cache hits. 

Summary: This problem demonstrates how compar- 
ing request flows can help developers account for un- 
expected performance loss when adding new features. 
In this case, the problem was due to unanticipated con- 
tention several layers of abstraction below the feature ad- 
dition. Note that diagnosis with Spectroscope is interac- 
tive, in this case involving developers iteratively modify- 
ing SM_THRESHOLD to gain additional insight. 

Quantitative analysis: For Table 3, results in the 
ranked list were deemed relevant if they contained at 
least 30 LOCK_ACQUIRE — LOCK_RELEASE edges. Re- 
sults for the output when SM_THRESHOLD was set to 10 
and 50 are reported. In both cases, response-time muta- 
tions caused by decay of the prefetching code path are 
conservatively considered false positives, since these re- 
gressions were not the focus of this diagnosis effort. 


$8.5 Create behaviour 


During development of Ursa Minor, we noticed that 
the distribution of request response times for CREATES 
in postmark-large increased significantly during the 
course of the benchmark. To diagnose this performance 
degradation, we used Spectroscope to compare request 
flows between the first 1,000 CREATES issued and the 
last 1,000. Due to the small number of requests com- 
pared, SM_THRESHOLD was set to 10. 

Spectroscope’s results showed categories that con- 
tained both structural and response-time mutations, with 
the highest-ranked one containing the former. The 
response-time mutations were the expected result of data 
structures in the NFS server and metadata server whose 
performance decreased linearly with load. Analysis of 
the structural mutations, however, revealed two architec- 
tural issues, which accounted for the degradation. 

First, to serve a CREATE, the metadata server executed 
a tight inter-component loop with a storage node. Each 
iteration of the loop required a few milliseconds, greatly 
affecting response times. Second, categories containing 
structural mutations executed this loop more times than 
their precursor categories. This inter-component loop 
can be seen easily if the categories are zoomed out to 
show only component traversals and plotted in a train 
schedule, as in Figure 8. 

Conversations with the metadata server’s developer 
led us to the root cause: recursive B-Tree page splits 
needed to insert the new item’s metadata. To ameliorate 
this problem, the developer increased the page size and 
changed the scheme used to pick the created item’s key. 

Summary: This problem demonstrates how request- 
flow comparison can be used to diagnose performance 
degradations, in this case due to a long-lived design 
problem. Though simple counters could have shown 
that CREATES were very expensive, they would not 


have shown that the root cause was excessive metadata 
server/storage node interaction. 

Quantitative analysis: For Table 3, results in the 
ranked list were deemed relevant if they contained struc- 
tural mutations and showed more interactions between 
the NFS server and metadata server than their most- 
likely precursor category. Response-time mutations that 
showed expected performance differences due to load are 
considered false positives. Coverage is not reported as it 
is not clear how to define problematic CREATES. 


8.6 Slowdown due to code changes 


This synthetic problem was injected into Ursa Minor to 
show how request-flow comparison can be used to diag- 
nose slowdowns due to feature additions or regressions 
and to assess Spectroscope’s sensitivity to changes in re- 
sponse time. 

Spectroscope was used to compare request flows be- 
tween two runs of SFS97. Problem period runs included 
a spin loop injected into the storage nodes’ WRITE code 
path. Any WRITE request that accessed a storage node in- 
curred this extra delay, which manifested in edges of the 
form * — STORAGE_NODE_RPC_REPLY. Normally, these 
edges exhibit a latency of 100us. 

Table 3 shows results from injecting 100us, SOOus, and 
Ims spin loops. Results were deemed relevant if they 
contained response-time mutations and correctly identi- 
fied the affected edges as those responsible. For the latter 
two cases, Spectroscope was able to identify the result- 
ing response-time mutations and localize them to the af- 
fected edges. Of the categories identified, only 6—7% are 
false positives and 100% of the 10 highest-ranked ones 
are relevant. The coverage is 92% and 93%. 

Variance in response times and the edge latencies in 
which the delay manifests prevent Spectroscope from 
properly identifying the affected categories for the 100us 
case. It identifies 11 categories that contain requests that 
traverse the affected edges multiple times as containing 
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Figure 8: Visualization of create behaviour. Two train- 


schedule visualizations are shown, the first one a fast early cre- 
ate during postmark-large and the other a slower create is- 
sued later in the benchmark. Messages are exchanged between 
the NFS Server (A), Metadata Server (B), Metadata Storage 
Node (C), and Data Storage Node (D). The first phase of the 
create procedure is metadata insertion, which is shown to be 
responsible for the majority of the delay. 
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response-time mutations, but is unable to assign those 
edges as the ones responsible for the slowdown. 


8.7 Periodic spikes 


Ursa minor-build, which is run as part of the nightly 
test suite, periodically shows a spike in the time required 
for its copy phase to complete. For example, from one 
particular night to another, copy time increased from 111 
seconds to 150 seconds, an increase of 35%. We initially 
suspected that the problem was due to an external pro- 
cess that periodically ran on the same machines as Ursa 
Minor’s components. To verify this assumption, we com- 
pared request flows between a run in which the spike was 
observed and another in which it was not. 

Surprisingly, Spectroscope’s output contained only 
one result: GETATTRS, which were issued more fre- 
quently during the problem period, but which had not 
increased in average response time. We ruled this result 
out as the cause of the problem, as NFS’s cache coher- 
ence policy suggests that an increase in the frequency 
of GETATTRs is the result of a performance change, 
not its cause. We probed the issue further by reducing 
SM_THRESHOLD to see if the problem was due to requests 
that had changed only a small amount in frequency, but 
greatly in response time, but did not find any such cases. 
Finally, to rule out the improbable case that the prob- 
lem was caused by an increase in variance of response 
times that did not affect the mean, we compared distribu- 
tions of intra-category variance between two periods us- 
ing the Kolmogorov-Smirnov test; the resulting p-value 
was 0.72, so the null hypothesis was not rejected. These 
observations convinced us the problem was not due to 
Ursa Minor or processes running on its machines. 

We next suspected the client machine as the cause of 
the problem and verified this to be the case by plotting a 
timeline of request arrivals and response times as seen by 
the NFS server (Figure 9). The visualization shows that 
during the problem period, response times stay constant 
but the arrival rate of requests decreases. We currently 
suspect the problem to be backup activity initiated from 
the facilities department (1.e., outside of our system). 

Summary: This problem demonstrates how compar- 
ing request flows can help diagnose problems that are not 
caused by internal changes. Informing developers that 
nothing within the distributed system has changed frees 
them to focus their efforts on external factors. 


9 Experiences at Google 


This section describes preliminary experiences using 
request-flow comparison, as implemented in Spectro- 
scope, to diagnose performance problems within select 
Google services. Sections 9.1 and 9.2 describe two such 
experiences. Section 9.3 discusses ongoing challenges in 
adapting request-flow comparison to large datacenters. 
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Figure 9: Timeline of inter-arrival times of requests at the 
NES Server. A 5s sample of requests, where each rectangle 
represents the process time of a request, reveals long periods of 
inactivity due to lack of requests from the client during spiked 
copy times (B) compared to periods of normal activity (A). 


9.1 Inter-cluster performance 


A team responsible for an internal service at Google ob- 
served that load tests run on their software in two dif- 
ferent clusters exhibited significantly different perfor- 
mance, though they expected performance to be similar. 
We used Spectroscope to compare request flows be- 
tween the two load test instances. The results showed 
many categories that contained response-time mutations; 
many were caused by latency changes not only within 
the service itself, but also within RPCs and within sev- 
eral dependencies, such as the shared Bigtable instance 
running in the lower-performing cluster. This led us to 
hypothesize that the primary cause of the slowdown was 
a problem in the cluster in which the slower load test 
was run. Later, we found out that the Bigtable instance 
running in the slower cluster was not working properly, 
confirming our hypothesis. This experience is a further 
example of how comparing request flows can help de- 
velopers rule out the distributed system (in this case, a 
specific Google service) as the cause of the problem. 


9.2 Performance change in a large service 


To help identify performance problems, Google keeps 
per-day records of average request latencies for major 
services. Spectroscope was used to compare two day- 
long periods for one such service, which exhibited a sig- 
nificant performance deviation, but only a small differ- 
ence in load, between the periods compared. Though 
many interesting mutations were identified, we were un- 
able to identify the root cause due to our limited knowl- 
edge of the service, highlighting the importance of do- 
main knowledge in interpreting Spectroscope’s results. 


9.3. Ongoing challenges with scale 


Challenges remain in scaling request-flow comparison 
techniques to large distributed services, such as those 
within Google. For example, categories generated for 
well-instrumented large-scale distributed services will be 
much larger than those observed for the 5-instance ver- 
sion of Ursa Minor. Additionally, they may yield many 
categories, each populated with too few requests for sta- 
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tistical rigor. Robust methods are needed to merge cat- 
egories and visualize them without losing important in- 
formation about structure, which occurs with Dapper be- 
cause of its graph compression methods. These meth- 
ods affected the quality of Spectroscope’s results by in- 
creasing variance, losing important structural differences 
between requests, and increasing effort needed to under- 
stand individual categories. Our experiences with unsu- 
pervised learning algorithms, such as clustering [4, 29], 
for merging categories indicate they are inadequate. A 
promising alternative is to use semi-supervised methods, 
which would allow the grouping algorithm to learn de- 
velopers’ mental models of which categories should be 
merged. Also, efficient visualization may be possible by 
only showing the portion of a mutation’s structure that 
differs between it and its precursors. 

More generally, request-flow graphs from large ser- 
vices are difficult to understand because such services 
contain many dependencies, most of which are foreign 
to their developers. To help, tools such as Spectroscope 
must strive to identify the semantic meaning of individ- 
ual categories. For example, they could ask developers to 
name graph substructures about which they are knowI- 
edgeable and combine them into a meaningful meta- 
name when presenting categories. 


10 Related work 


A number of techniques have been developed for di- 
agnosing performance problems in distributed systems. 
Whereas many rely on end-to-end tracing, others attempt 
to infer request flows from existing data sources, such as 
message send/receive events [27] or logs [38]. These lat- 
ter techniques trade accuracy of re-constructed request 
flows for ease of using existing monitoring mechanisms. 
Other techniques rely on black-box metrics and are lim- 
ited to localizing problems to individual machines. 

Magpie [4], Pinpoint [9], WAPS5 [27], and Xu [38], 
all identify anomalous requests by finding rare ones that 
differ greatly from others. In contrast, request-flow com- 
parison identifies the changes in distribution between two 
periods that most affect performance. Pinpoint also de- 
scribes other ways to use end-to-end traces, including for 
statistical regression testing, but does not describe how to 
use them to compare request flows. 

Google has developed several analysis tools for use 
with Dapper [31]. Most relevant is the Service Inspector, 
which shows graphs of the unique call paths observed to 
a chosen function or component, along with the resulting 
call tree below it, allowing developers to understand the 
contexts in which the chosen item is used. Because the 
item must be chosen beforehand, the Service Inspector is 
not a good fit for problem localization tasks. 

Pip [26] compares developer-provided, component- 
based expectations of structural and timing behaviour to 


actual behaviour observed in end-to-end traces. Theoret- 
ically, Pip can be used to diagnose any type of problem: 
anomalies, correctness problems, etc. But, it relies on de- 
velopers to specify expectations, which is a daunting and 
error-prone task—the developer is faced with balancing 
effort and generality against the specificity needed to ex- 
pose particular problems. In addition, Pip’s component- 
centric expectations, as opposed to request-centric ones, 
complicate problem localization tasks [10]. Nonetheless, 
in many ways, comparing request flows between execu- 
tions 1s akin to Pip, with developer-provided expectations 
being replaced with the observed non-problem period be- 
haviour. Many of our algorithms, such as for ranking 
mutations and highlighting the differences, could be used 
with Pip-style expectations as well. 

The Stardust tracing infrastructure on which our im- 
plementation builds was originally designed to enable 
performance models to be induced from observed sys- 
tem performance [32, 34]. Building on that initial work, 
IRONmodel [33] developed approaches to detecting (and 
correcting) violations of such models, which can indi- 
cate performance problems. In describing IRONmodel, 
Thereska et al. also proposed that the specific nature of 
how observed behaviour diverges from the model could 
guide diagnoses, but they did not develop techniques for 
doing so or explore the approach in depth. 

A number of black-box diagnosis techniques have 
been devised for systems that do not have the detailed 
end-to-end tracing on which our approach to comparing 
request flows relies. For example, Project 5 [2] infers 
bottlenecks by observing messages passed between com- 
ponents. Comparison of performance metrics exhibited 
by systems that should be doing the same work can also 
identify misbehaving nodes [18, 24]. Such techniques 
can be useful parts of a suite, but are orthogonal to the 
contributions of this paper. 

There are also many single-process diagnosis tools 
that inform creation of techniques for distributed sys- 
tems. For example, Delta analysis [36] compares mul- 
tiple failing and non-failing runs to identify the most sig- 
nificant differences. OptiScope [22] compares the code 
transformations made by different compilers to help de- 
velopers identify important differences that affect perfor- 
mance. DARC [35] automatically profiles system calls to 
identify the greatest sources of latency. Our work builds 
on some concepts from such single-process techniques. 


11 Conclusion 


Comparing request flows, as captured by end-to-end 
traces, is a powerful new technique for diagnosing per- 
formance changes between two time periods or system 
versions. Spectroscope’s algorithms for this compari- 
son allow it to accurately identify and rank mutations 
and identify their precursors, focusing attention on the 
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most important differences. Experiences with Spectro- 
scope confirm its usefulness and efficacy. 
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Abstract 


Network performance problems are notoriously tricky 
to diagnose, and this is magnified when applications 
are often split into multiple tiers of application com- 
ponents spread across thousands of servers in a data 
center. Problems often arise in the communication be- 
tween the tiers, where either the application or the net- 
work (or both!) could be to blame. In this paper, we 
present SNAP, a scalable network-application profiler 
that guides developers in identifying and fixing perfor- 
mance problems. SNAP passively collects TCP statistics 
and socket-call logs with low computation and storage 
overhead, and correlates across shared resources (e.g., 
host, link, switch) and connections to pinpoint the lo- 
cation of the problem (e.g., send buffer mismanage- 
ment, TCP/application conflicts, application-generated 
microbursts, or network congestion). Our one-week de- 
ployment of SNAP in a production data center (with 
over 8,000 servers and over 700 application components) 
has already helped developers uncover 15 major per- 
formance problems in application software, the network 
stack on the server, and the underlying network. 


1 Introduction 


Modern data-center applications, running over networks 
with unusually high bandwidth and low latency, should 
have great communication performance. Yet, these ap- 
plications often experience low throughput and high 
delay between the front-end user-facing servers and 
the back-end servers that perform database, storage, 
and indexing operations. Troubleshooting network 
performance problems is hard. Existing solutions— 
like detailed application-level logs or fine-grain packet 
monitoring—are too expensive to run continuously, and 
still offer too little insight into where performance prob- 
lems lie. Instead, we argue that data centers should per- 
form continuous, lightweight profiling of the end-host 


' Microsoft 


network stack, coupled with algorithms for classifying 
and correlating performance problems. 


1.1 Troubleshooting Network Performance 


The nature of the data-center environment makes detect- 
ing and locating performance problems particularly chal- 
lenging. Applications typically consist of tens to hun- 
dreds of application components, arranged in multiple 
tiers of front-ends and back-ends, and spread across hun- 
dreds to tens of thousands of servers. Application devel- 
opers are continually updating their code to add features 
or fix bugs, so application components evolve indepen- 
dently and are updated while the application remains in 
operation. Human factors also enter into play: most de- 
velopers do not have a deep understanding of how their 
design decisions interact with TCP or the network. There 
is aconstant influx of new developers for whom the intri- 
cacies of Nagle’s algorithm, delayed acknowledgments, 
and silly window syndrome remains a mystery.! 

As a result, new network performance problems hap- 
pen all the time. Compared to equipment failures that 
are relatively easy to detect, performance problems are 
tricky because they happen sporadically and many dif- 
ferent components could be responsible. The developers 
sometimes blame “the network” for problems they can- 
not diagnose; in turn, the network operators blame the 
developers if the network shows no signs of equipment 
failures or persistent congestion. Often, they are both 
right, and the network stack or some subtle interaction 
between components is actually responsible [2, 3]. For 
example, an application sending small messages can trig- 
ger Nagle’s algorithm in the TCP stack, causing trans- 
mission delays leading to terrible application throughput. 

In the production data center we study, the process of 
actually detecting and locating even a single network per- 


! Some applications (like memcached [1]) use UDP, and re- 
implement reliability, error detection, and flow control; however, these 
mechanisms can also introduce performance problems. 
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formance problem typically requires tens to hundreds of 
hours of the developers’ time. They collect detailed ap- 
plication logs (too heavy-weight to run continuously), 
deploy dedicated packet sniffers (too expensive to run 
ubiquitously), or sample the data (too coarse-grained to 
catch performance problems). They then pore over these 
logs and traces using a combination of manual inspec- 
tion and custom-crafted analysis tools to attempt to track 
down the issue. Often the investigation fails or runs 
out of time, and some performance problems persist for 
months before they are finally caught and corrected. 


1.2 Lightweight, Continuous Profiling 


In this paper, we argue that the data centers should con- 
tinuously profile network performance, and analyze the 
data in real time to help pinpoint the source of the prob- 
lems. Given the complexity of data-center applications, 
we cannot hope to fully automate the detection, diagno- 
sis, and repair of network performance problems. In- 
stead, our goal is dramatically reducing the demand for 
developer time by automatically identifying performance 
problems and narrowing them down to specific times and 
places (e.g., send buffer, delayed ACK, or network con- 
gestion). Any viable solution must be 


e Lightweight: Running everywhere, all the time, re- 
quires a solution that is very lightweight (in terms of 
CPU, storage, and network overhead), so as not to 
degrade application performance. 


e Generic: Given the constantly changing nature of 
the applications, our solution must detect problems 
without depending on detailed knowledge of the ap- 
plication or its log formats. 


e Precise: To provide meaningful insights, the solu- 
tion must pinpoint the component causing network 
performance problems, and tease apart interactions 
between the application and the network. 


Finally, the system should help two very different 
kinds of users: (1) a developer who needs to detect, di- 
agnose, and fix performance problems in his particular 
application and (11) a data-center operator who needs to 
understand performance problems with the underlying 
platform so that he can tune the network stack, change 
server placement, or upgrade network equipment. In this 
paper, we present SNAP (Scalable Network-Application 
Profiler), a tool that enables application developers and 
data-center operators to detect and diagnose these perfor- 
mance problems. SNAP represents an “existence proof” 
that a tool meeting our three requirements can be built, 
deployed in a production data center, and provide valu- 
able information to both kinds of users. 
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SNAP capitalizes on the unique properties of modern 
data centers: 


e SNAP has full knowledge of the network topology, 
the network-stack configuration, and the mapping 
of applications to servers. This allows SNAP to use 
correlation to identify applications with frequent 
problems, as well as congested resources (e.g., hosts 
or links) that affect multiple applications. 


e SNAP can instrument the network stack to ob- 
serve the evolution of TCP connections directly, 
rather than trying to infer TCP behavior from packet 
traces. In addition, SNAP can collect finer-grain in- 
formation, compared to conventional SNMP statis- 
tics, without resorting to packet monitoring. 


In addition, once the developers fix a problem (or the 
operator tunes the underlying platform), we can verify 
that the change truly did improve network performance. 


1.3 SNAP Research Contributions 


SNAP passively collects TCP statistics and socket-level 
logs in real time, classifies and correlates the data to pin- 
point performance problems. The profiler quickly iden- 
tifies the right location (end host, link, or switch), the 
right layer (application, network stack, or network), at 
the right time. Our major contributions of the paper are: 


Efficient, systematic profiling of network-application 
interactions: SNAP provides a simple, efficient way 
to detect performance problems through real-time anal- 
ysis of passively-collected measurements of the network 
stack. We provide a systematic way to identify the com- 
ponent (e.g., sender application, send buffer, network, 
or receiver) responsible for the performance problem. 
SNAP also correlates across connections that belong to 
the same application, or share underlying resources, to 
provide more insight into the sources of problems. 


Performance characterization of a production data 
center: We deployed SNAP in a data center with over 
8,000 servers, and over 700 application components (in- 
cluding map-reduce, storage, database, and search ser- 
vices). We characterize the sources of performance prob- 
lems, which helps data-center operators improve the un- 
derlying platform and better tune the network. 


Case studies of performance bugs detected by SNAP: 
SNAP pinpointed 15 significant and unexpected prob- 
lems in application software, the network stack, and the 
interaction between the two. SNAP saved the developers 
significant effort in locating and fixing these problems, 
leading to large performance improvements. 

Section 2 presents the design and development of 
SNAP. Section 3 describes our data-center environment 
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Figure 1: SNAP socket-level monitoring and analysis 


and how SNAP was deployed. Section 4 validates SNAP 
against both labeled data (i.e., known performance prob- 
lems) and detailed packet traces. Then, we present an 
evaluation of our one-week deployment of SNAP from 
the viewpoint of both the data-center operator (Section 5) 
and the application developer (Section 6). Section 7 
shows how to reduce the overhead of SNAP through dy- 
namic tuning of the polling rate. Section 8 discusses re- 
lated work and Section 9 concludes the paper. 


2 Design of the SNAP Profiler 


In this section, we describe how SNAP pinpoints per- 
formance problems. Figure | shows the main compo- 
nents of our system. First, we collect TCP-connection 
statistics, augmented by socket-level logs of application 
read and write operations, in real time with low overhead. 
Second, we run a TCP classifier that identifies and cate- 
gorizes periods of bad performance for each socket, and 
logs the diagnosis and a time sequence of the collected 
data. Finally, based on the logs, we have a centralized 
correlator that correlates across connections that share a 
common resource or belong to the same application to 
pinpoint the performance problems. 


2.1 Socket-Level Monitoring of TCP 


Data centers host a wide variety of applications that may 
use different communication methods and design pat- 
terns, so our techniques must be quite general in order to 
work across the space. The following three goals guided 
the design of our system, and led us away from using the 
SNMP statistics, packet traces, or application logs. 


(i) Fine-grained profiling: The data should be fine- 
grained enough to indicate performance problems for in- 
dividual applications on a small timescale (e.g, tens of 
milliseconds or seconds). Switches typically only cap- 
ture link loads at a one-minute timescale, which 1s far too 
coarse-grained to detect many performance problems. 
For example, the TCP incast problem [3], caused by mi- 
cro bursts of traffic at the timescale of tens of millisec- 
onds, is not even visible in SNMP data. 


Statistic Definition 
CurAppWQueue | # of bytes in the send buffer 
MaxAppWQueue | Max # of bytes in send buffer 

over the entire socket lifetime 
#FastRetrans Total # of fast retransmissions 
#Timeout Total # of timeouts 
#SampleRTT Total # of RTT samples 
#SumRTT Sum of RTTs that TCP samples 
RwinLimitTime Cumulated time an application 
is receiver window limited 
CwinLimitTime Cumulated time an application 
is congestion window limited 
SentBytes Cumulated # of bytes the socket 
has sent over the entire lifetime 
Cwin Current congestion window 
Rwin Announced receiver window 


Table 1: Key TCP-level statistics for each socket [5] 


(ii) Low overhead: Data centers can be huge, with hun- 
dreds of thousands of hosts and tens of thousands sockets 
on each host. Yet, the data collection must not degrade 
application performance. Packet traces are too expen- 
sive to capture in real time, to process at line speed, or 
to store on disk. In addition, capturing packet traces on 
high-speed links (e.g., 1-10 Gbps in data centers) often 
leads to measurement errors including drops, additions, 
and resequencing of packets [4]. Thus it is impossible 
to capture packet trace everywhere, all the time to catch 
new performance problems. 


(iii) Generic across applications: Individual applica- 
tions often generate detailed logs, but these logs differ 
from one application to another. Instead, we focus on 
measurements that do not require application support so 
our tool can work across a variety of applications. 

Through our work on SNAP, we found that the follow- 
ing two kinds of per-socket information can be collected 
cheaply enough to be used in analysis of large-scale data 
center applications, while still providing enough insight 
to diagnose where the performance problem lie (whether 
they are from the application software, from network is- 
sues, or from the interaction between the two). 


TCP-level statistics: RFC 4898 [5] defines a mecha- 
nism for exposing the internal counters and variables of a 
TCP state-machine that is implemented in both Linux [6] 
and Windows [7]. We select and collect the statistics 
shown in Table 1 based on our diagnosis experience’, 
which together expose the data-transfer performance of 
a socket. There are two types of statistics: (1) instanta- 


neous snapshots (e.g., Cwin) that show the current value 


*There are a few other variables in the TCP stack such as the time 
TCP spends in SlowStart stage, which are also useful but we did not 
mention in the paper due to space limit. 
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Locations Problems App/Net 


Sender app Sender app limited 
Send buffer Send buffer limited App and Net 


Network 


Timeout Net 


Receiver 


Detection method 
Not any other problems 
CurAppWQueue ~ MaxAppWQueue 
diff(#FastRetrans) > 0 
diff(#Timeout) > 0) 


Delayed ACK App and Net | diff(SumRTT) > diff(SampleRTT)*MaxQueuingDelay 


Receiver window limited | App and Net 


diff(#RwinLimitTime) > O 


Table 2: Classes of network performance for a socket 


of a variable in the TCP stack; and (2) cumulative coun- 
ters (e.g., #FastRetrans) that count the number of events 
(e.g., the number of fast retransmissions) that happened 
over the lifetime of the socket. #SampleRTT and Sum- 
RTT are the cumulative values of the number of packets 
TCP sampled and the sum of the RTTs for these sampled 
packets. To calculate the retransmission timeout (RTO), 
TCP randomly samples one packet in each congestion 
window, and measures the time from the transmission of 
a packet to the time TCP receives the ACK for the packet 
as the RTT for this packet. 


These statistics are updated by the TCP stack as indi- 
vidual packets are sent and received, making it too ex- 
pensive to log every change of these values. Instead, 
we periodically poll these statistics. For the cumulative 
counters, we calculate the difference between two polls 
(e.g., diff(#FastRetrans)). For snapshot values, we sam- 
ple with a Poisson interval. According to the PASTA 
property (Poisson Arrivals See Time Averages), the sam- 
ples are a representative view of the state of the system. 


Socket-call logs: Event-tracing systems in Windows [8] 
and Linux [9] record the time and number of bytes 
(ReadBytes and WriteBytes) whenever the socket makes 
a read/write call. Socket-call logs show the applica- 
tions’ data-transfer behavior, such as how many connec- 
tions they initiated, how long they maintain each con- 
nection, and how much data they read/write (as opposed 
to the data that TCP actually transfers, i.e., SentBytes). 
These logs supplement the TCP statistics with applica- 
tion behavior to help developers diagnose problems. The 
socket-level logs are collected in an event-driven fashion, 
providing fine-grained information with low overhead. 
In comparison, the TCP statistics introduce a trade-off 
between accuracy and the polling overhead. For exam- 
ple, if SNAP polls TCP statistics once per second, a short 
burst of packet losses is hard to distinguish from a mod- 
est loss rate throughout the interval. 


In summary, SNAP collects two types of data in the 
following formats: (1) timestamp, 4-tuples (source and 
destination address/port), ReadBytes, and WriteBytes; 
and (11) timestamp, 4-tuples, TCP-level logs (Table 1). 
SNAP uses TCP-level logs to classify the performance 
problems and pinpoint the location of the problem, and 
then provides both the relevant TCP-level and socket- 
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level logs for the affected connections for that period of 
time. Developers can use these logs to quickly find the 
root cause of performance problems. 


2.2 Classifying Single-Socket Performance 


Although it is difficult to determine the root cause of per- 
formance problems, we can pinpoint the component that 
is limiting performance. We classify performance prob- 
lems in terms of the stages of data delivery, as summa- 
rized in the two columns of Table 2°: 

1. Application generates the data: The sender appli- 
cation may not generate the data fast enough, either by 
design or because of bottlenecks elsewhere (e.g., CPU, 
memory, or disk). For example, the sender may write a 
small amount of data, triggering Nagle’s algorithm [10] 
which combines small writes together into larger packets 
for better network utilization, at the expense of delay. 

2. Data are copied from the application buffer 
to the send buffer: Even when the network is not 
congested, a small send buffer can limit throughput by 
stalling application writes. The send buffer must keep 
data until acknowledgments arrive from the receiver, lim- 
iting the buffer space available for writing new data. 

3. TCP sends the data to the network: A congested 
network may drop packets, leading to lower throughput 
or higher delay. The sender can detect packet loss by 
receiving three duplicate ACKs, leading to a fast retrans- 
mission. When packet losses do not trigger a triple du- 
plicate ACK, the sender must wait for a retransmission 
timeout (RTO) to detect loss and retransmit the data. 

4. Receiver receives the data and sends an acknowl- 
edgment: The receiver may not read data, or acknowI- 
edge their arrival, quickly enough. The receiver window 
can limit the throughput if the receiver is not reading the 
data quickly enough (e.g., caused by a CPU starvation), 
allowing data to fill the receive buffer. A receiver delays 
sending acknowledgments in the hope of piggybacking 
the ACK on data in the reverse direction. The receiver 
acknowledges every other packet and waits up to 200 ms 
before sending an ACK. 


>The table only summarizes major performance problems and can 
be extended to cover other problems such as out-of-order packets. 
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The TCP statistics provide direct visibility into cer- 
tain performance problems like packet loss and receiver- 
window limits, where cumulative counts (e.g., #7ime- 
out, #FastRetrans, and RwinLimitTime) indicate whether 
the problem occurred at any time during the polling in- 
terval. Detecting other problems relies on an instanta- 
neous snapshot, such as comparing the current backlog 
of the send buffer (CurAppWQueue) to its maximum size 
(MaxAppWQueue); polling with a Poisson distribution 
allows SNAP to accurately estimate the fraction of time 
a connection is send-buffer limited. Pinpointing other 
latency problems requires some notion of expected de- 
lays. For example, the RTT should not be larger than 
the propagation delay plus the maximum queuing de- 
lay (MaxQueuingDelay) (whose value is measured in ad- 
vance by operators), unless a problem like delayed ACK 
occurs. SNAP incorporates knowledge of the network 
configuration to identify these parameters. 

SNAP detects send-buffer, network, and receiver prob- 
lems using the rules listed in the last column of Table 2, 
where multiple problems may take place for the same 
socket during the same time interval. If any of these 
problems are detected, SNAP logs the diagnosis and all 
the variables in Table 1—as well as WriteBytes from the 
socket-call data—to provide the developers with detailed 
information to track down the problem. In the absence 
of any of the previous problems, we classify the connec- 
tion as sender-application limited during the time inter- 
val, and log only the socket-call data to track application 
behavior. Being sender-application limited should be the 
most common scenario for a connection. 


2.3 Correlation Across TCP Connections 


Although SNAP can detect performance problems on in- 
dividual connections in isolation, combining information 
across multiple connections helps pinpoint the location 
of the problem. As such, a central controller analyzes 
the results of the TCP performance classifier, as shown 
earlier in Figure 1. The central controller can associate 
each connection with a particular application and with 
shared resources like a host, links, and switches. 


Pinpointing resource constraints (by correlating con- 
nections that share a host, link, or switch): Topology 
and routing data allow SNAP to identify which connec- 
tions share resources such as a host, link, top-of-rack 
switch, or aggregator switch. SNAP checks if a per- 
formance problem (as identified by the algorithm in Ta- 
ble 2) occurs on many connections traversing the same 
resource at the same time. For example, packet losses 
(.e., diff(#FastRetrans) > O or diff(#Timeout) > 0) on 
multiple connections traversing the same link would in- 
dicate a congested link. This would detect congestion 
occurring on a much smaller timescale than SNMP could 


measure. As another example, send-buffer problems for 
many connections on the same host could indicate that 
the machine has insufficient memory or a low default 
configuration of the send-buffer size. 


Pinpoint application problem (by correlating across 
connections in the same application): SNAP also re- 
ceives a mapping of each socket (as identified by the 
four-tuple) to an application. SNAP checks if a perfor- 
mance problem occurs on many connections from the 
same application, across different machines and differ- 
ent times. If so, the application software may not interact 
well with the underlying TCP layer. With SNAP, we have 
found several application programs that have severe per- 
formance problems and are currently working with de- 
velopers to address them, as discussed in Section 6. 

The two kinds of correlation analysis are similar, ex- 
cept for (1) sets of connections to compare S (i.e., con- 
nections sharing a resource vs. belonging to the same 
service) and (11) the timescale for the comparison — cor- 
relation interval 7’ (1.e., transient resource constraining 
events taking a few minutes or hours vs. permanent ser- 
vice code problems that lasts for days). 

We use a simple linear correlation heuristic that works 
well in our setting Given a set of connections S and a cor- 
relation interval 7’, the SNAP correlation algorithm out- 
puts whether these connections have correlated perfor- 
mance problems, and provides a time sequence of SNAP 
logs for operators and developers to diagnose. 


We construct a performance vector Pr(c, t) = 
(timex (p1, C), ..., tomex (ps5, C))e=1..(7/t]> Where ¢ is an 
aggregation time interval in T’ and time, (p;)(i = 1..5) 
denotes the total time that connection c is having prob- 
lem p; during time period [(k — 1)t, kt].4 We pick c, 
and cs in S, calculate the Pearson correlation coefficient, 
and check if the average across all pairs of connections 
(Average Correlation Coefficient ACC) is larger than a 
threshold a: 
ACC = 


avg __ (cor(Pr(ci,t), Pr(ce,t)) > a, 


Cl ,c2ES,c1F#-C2 


where 


cor( x, VT) = _ da(ti —T)(Yi = 9) 


Dele — EP (Yi — 9)? 


If the correlation coefficient is high, SNAP reports that 
the connections in S have a common problem. _— To 
extend this correlation for different classes of problems 
(e.g., one connection’s delayed ACK problem triggers 


4;(i = 1..5) are the problems of send buffer limited, fast retrans- 
mission, timeout, delayed ACK and receiver window limited respec- 
tively. We do not include sender application limited because its time 
could be determined given the times of the first five problems. 
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Characteristic Value 
#Hosts SK 
#Applications 700 
Operating systems Win 2003,2008R2 
Default send buffer 8 KB 
Maximum segment size (MSS) | 1460 Bytes 
Minimum retrans. timeout 300 ms 
Delayed ACK timeout 200 ms 
Nagle’s algorithm mostly off 
Slow start restart off 
Receiver window autotuning off 


Table 3: Characteristics in the production data center. 


the sender application limited problem on another con- 
nection), we can extend our solution to use other infer- 
ence techniques [11, 12] or principal component analysis 
(PCA) [13]. 

In practice, we must choose ¢ carefully. With a large 
value of t, SNAP only compares the coarse-grained per- 
formance between connections; for example, if t = T, 
we only check if two connections have the same perfor- 
mance problem with the same percentage of time. With a 
small t, SNAP can detect fine-grained performance prob- 
lems (e.g., two connections experiencing packet loss at 
almost the same time), but are susceptible to clock dif- 
ferences of the two machines and any differences in the 
polling rates for the two connections. The aggregation 
interval t should be large enough to mask the differences 
between the clocks and cannot be smaller than the least 
common multiple of the polling intervals of the connec- 
tions. 


3 Production Data Center Deployment 


We deployed SNAP in a production data center. This sec- 
tion describes the characteristics of the data center and 
the configuration of SNAP, to set the stage for the fol- 
lowing sections. 


3.1 Data Center Environment 


The data center consists of 8K hosts and runs 700 appli- 
cation components, with the configuration summarized 
in Table 3. The hosts run either Windows Server 2008 
R2 or Windows Server 2003. The default send buffer 
size 1s 8K, and the maximum segment size is 1460 Bytes. 
The minimum retransmission timeout for packet loss is 
set to 300 ms, and the delayed-acknowledgment timeout 
is 200 ms. These values in Windows OS are configured 
for Internet traffic with long RTT. 

While the OS enables Nagle’s algorithm (which com- 
bines small writes into larger packets) by default, most 
delay-sensitive applications disable Nagle’s algorithm 
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using the NO_DELAY socket option. 


Most applications in the data center use persistent 
connections to avoid establishing new TCP connections 
whenever they have data to transmit. Slow-start restart 
is disabled to reduce the delay arising when applications 
transfer a large amount of data after an idle period over a 
persistent connection. 


Receiver-window autotuning—a feature in Windows 
Server 2008 that allows TCP to dynamically tune the re- 
ceiver window size to maximize throughput—is disabled 
to avoid bugs in the TCP stack (e.g., [14]). Windows 
Server 2003 does not support this feature. 


3.2 SNAP Configuration 


We ran SNAP continuously for a week in August 2010. 
The polling interval for TCP statistics follows the Pois- 
son distribution with an average inter-arrival time of 500 
ms. We collected the socket-call logs for all the connec- 
tions from and to the servers running SNAP. Over the 
week, we collected less than 1 GB on each host per day 
and the total is just terabytes of logs for the week. This 
is a very small amount of data compared to packet traces 
which take more than 180 GB per host per day at a 1 
Gbps link, even if we just keep packet header informa- 
tion. 


To identify the connections sharing the same switch, 
link, and application, we collect the information about 
the topology, routing, and the mapping between sockets 
and applications in the data center. We collect topology 
and routing information from the data center configura- 
tion files. To identify the mapping between the sockets 
and applications, we first run a script at each machine to 
identify the process that created each socket. We then 
map the processes to the application based on the config- 
uration file for the application deployment. 


To correlate performance problems across connections 
using the correlation algorithm we proposed in Sec- 
tion 2.3, we chose two seconds as the aggregation inter- 
val t to summarize the time on each performance prob- 
lems to mask time difference between machines. To pin- 
point transient resource constraints which usually last for 
minutes or hours, we chose one hour as the correlation 
interval 7’. To pinpoint problems from application code 
which usually last for days, we chose 24 hours as the 
correlation interval 7’. We chose the correlation thresh- 
old a = 0.4 


>It is hard to determine the threshold a in practice. Operators can 
choose the top n shared resources/application code to investigate their 
performance problems. 
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4 SNAP Validation 


To validate the design of SNAP in Section 2 and eval- 
uate whether SNAP can pinpoint the performance prob- 
lems at the right place and time, we take two approaches: 
First, we inject a few known problems in our production 
data center and check if SNAP correctly catches these 
problems; Second, to validate the decision methods that 
use inference to determine the performance class in Ta- 
ble 2 rather than observing from TCP statistics directly, 
we compare SNAP results against packet traces. 


4.1 Validation by Injecting Known Bugs 


To validate SNAP, we injected a few known data-center 
networking problems and verified if SNAP correctly 
classifies those problems for each connection. Next, run- 
ning our correlation algorithm on the SNAP logs of these 
labeled problems together with the other logs from the 
data center, SNAP correctly pinpointed all the labeled 
problems. For brevity, we first discuss two representative 
problems in detail and then show how SNAP pinpoints 
problematic host for each of them. 


Problems in receive-window autotuning: We first 
injected a receiver-window autotuning problem: This 
problem happens when a Windows Server 2008 R2 ma- 
chine initiates a TCP connection to a Windows Server 
2003 machine with a SYN packet that requests the re- 
ceiver window autotuning feature. But due to a bug in 
the TCP stack of the Windows Server 2003°, the 2003 
server does not parse the request for the receiver window 
autotuning feature correctly, and returns the SYN ACK 
packet with a wrong format. As a result, the 2008 server 
tuned its receiver window to four Bytes, leading to low 
throughput and long delay. 

To inject this problem, we picked ten hosts running 
Windows 2008 in the data center and turn on their re- 
ceiver window autotuning feature. Each of the ten hosts 
initiated TCP connections to a HTTP server running 
Windows 2003 to fetch 20 files of 5KB each from a 
host running Windows 2003.’ It took the Windows 2003 
server more than 5 seconds to transfer each SKB file. 
SNAP correctly reported that all these connections are 
receiver window limited all the time, and SNAP logs 
showed that the announced receiver window size (RWin) 
is 4 Byte. 


TCP incast: TCP incast [3] is a common performance 
problem in data centers. It happens when an aggregator 
distributes a request to a group of workers, and after pro- 
cessing the requests, these workers send the responses 


©This bug is later fixed with a patch, but some machines do not have 
the latest patch. 

7We ran ten hosts to the same 2003 server to validate if the SNAP 
can correlate these connections and pinpoint the server. 
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Figure 2: PDF of #Machines with different average correlation 
coefficient. 


back at almost the same time. These responses together 
overflow the switch on the path and experience signifi- 
cant packet losses. 

We wrote an application that generates a TCP incast 
traffic pattern. To limit the effect of our experiment 
to the other applications in the production data center, 
we picked 36 hosts under the same top-of-rack switch 
(TOR), used one host as the aggregator to send requests 
to the remaining 35 hosts which serve as workers. These 
workers respond with 1OOKB data immediately after they 
receive the requests. After receiving all the responses, 
the aggregator sends another request to the workers. The 
aggregator sends 20 requests in total. 

SNAP correctly identified that seven of the 35 con- 
nections have experienced a significant amount of packet 
loss causing retransmission timeouts. This is verified 
from our application logs which show that it takes much 
longer time to get the response through the seven con- 
nections than the rest of the connections.® 


Correlation to pinpoint resource constraints for the 
two problems: We mixed the SNAP logs of the re- 
ceiver window autotuning problem and TCP incast with 
the logs of an hour period collected at all other machines 
in the data center. Then we ran SNAP correlation algo- 
rithm across the connections sharing the same machine. 

SNAP correctly identified the Windows Server 2003 
servers that have receiver-window limited problems 
across 5-10 connections with an average correlation co- 
efficient (ACC) of 0.8. SNAP also correctly identified the 
aggregator machine because the ACC across all the con- 
nections that traverse the TOR is 0.45. Both are above 


8Tn this experiment, SNAP can only tell that the connections have 
correlated timeouts. If the same problem happens for different aggre- 
gators running the same application code, we can tell that the appli- 
cation code causes the timeouts. If SNAP reports all the connections 
have simultaneous small writes (identified from socket call logs) and 
correlated timeouts, we can infer that the application code has incast 
problems. 
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the threshold ~ = 0.4, which is chosen based on the dis- 
cussion in Section 3. 

Our correlation algorithm clearly distinguished the 
two injected problems with the performance of connec- 
tions on the other machines in the data center. Figure 2 
presents the probability density function (PDF) of the 
number of machines with different values of ACC. Only 
2.7% of the machines have an ACC larger than 0.4. In ad- 
dition to the two injected problems, the other machines 
with ACC > 0.4 may also indicate some problems that 
happen during our experiment, but we have not verified 
these problems yet. 


4.2 Validation Against Packet Traces 


We also need to validate the performance-classification 
algorithm defined in Table 2. The detection methods for 
the performance class of fast retransmissions, timeouts, 
receiver window limited is always accurate because these 
statistics are directly observed phenomena (e.g., #Time- 
outs) from the TCP stack. The accuracy of identifying 
send buffer problems is closely related to the probability 
of detecting the moments when the send buffer is full in 
the Poisson sampling, which is well studied in [15]. 

There is a tradeoff between the overhead and accuracy 
of identifying delayed ACK. The accuracy of identify- 
ing the delayed ACK and small writes classes is closely 
related to the estimation of the RTT. However, we can- 
not get per-packet RTT from the TCP stack because it 
is a Significant overhead to log data for each packet. In- 
stead, we get the sum of estimated RTTs (SumRTT) and 
the number of sampled packets (SampleRTT) from the 
TCP stack. 

We evaluate the accuracy of identifying delayed ACK 
in SNAP by comparing SNAP’s results with the packet 
trace. We picked two real-world applications from the 
production data center for which SNAP detects delayed 
ACK problems: One connection serves as an aggrega- 
tor distributing requests for a Web application that has 
the delayed ACK problems for 100% of the packets’. 
Another belongs to a configuration-file distribution ser- 
vice for various jobs running in the data center, which 
has 75% of the packets on average experiencing delayed 
ACK. While running SNAP with various polling rates, 
we captured packet traces simultaneously. We then com- 
pared the results of SNAP with the number of delayed- 
ACK incidents we identify from packet traces. 

To estimate the number of packets that experience de- 
layed ACK, SNAP should find a distribution of RTTs 
for the sampled packets that sum up to SumRTT. Those 


°This application distributes requests whose size is smaller than 
MSS (1.e., one packet), and waits more than the delayed ACK time- 
out 200 ms before sending out another request. So the receiver has to 
keep each packet for 200 ms before sending the ACK to the sender. 
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Figure 3: SNAP estimation error of identifying delayed ACK 
problems. 


packets that experience delayed ACK have a RTT around 
DelayedACKTimeout. The rest of the packets all ex- 
perience the maximum queuing delay. Therefore, we 
use the equation: (diff(#SumRTT) — diff(#SampleRTT) 
* MaxQueuingDelay)/DelayedACKTimeout to count the 
number of packets experiencing delayed ACK. We use 
MaxQueuingDelay = 10 ms and DelayedACKTimeout = 
180 ms. The delayed timeout is set as 200 ms in TCP 
stack, but TCP timer is only accurate at 10 ms level and 
thus the real DelayedACKTimeout varies around 200 ms. 
So we use 180 ms to be conservative on the delayed ACK 
estimation. 

Figure 3 shows the estimation error of SNAP’s results 
which is defined by (d; —d, ) /dz, where d, is the percent- 
age of packets that experience delayed ACK reported by 
SNAP and d; is the actual percentage of delayed ACK we 
get from the packet trace. For the application that always 
has delayed ACK, SNAP’s estimation error is 0.006 on 
average. For the application that has 75% of packets ex- 
periencing delayed ACK, the estimation error is within 
0.2 for the polling intervals that range from 500 ms to 10 
sec. 

Figure 3 shows that the estimation error drops from 
positive (underestimation) to negative (overestimation) 
with the increase of the polling interval. When the 
polling interval is smaller than 200 ms, there is at most 
one packet experiencing delayed ACK in one polling in- 
terval. If a few packets take less than MaxQueuingDelay 
to transfer, we would overestimate the part of SumRTT 
that is contributed by these packets, and thus the rest of 
RTT is less than DelayedACKTimeout. When the polling 
interval is large, there are more packets experiencing de- 
layed ACK in the same time interval. Since we have use 
180 ms instead of 200 ms to detect delayed ACK, we 
would underestimate those packets that take longer than 
180 ms delayed ACK. Nine such packets would con- 
tribute enough RTT for SNAP to assume one more de- 
layed ACK. 


USENIX Association 


USENIX Association 


5 Profiling Data Center Performance 


We deployed SNAP in the production data center to char- 
acterize different classes of performance problems, and 
provided information to the data-center operators about 
problems with the network stack, network congestion or 
the interference between services. We first characterize 
the frequency of each performance problem in the data 
center, and then discuss the key performance problems 
in our data center—packet loss and the TCP send buffer. 


5.1 Frequency of Performance Problems 


Table 4 characterizes the frequency of the network per- 
formance problems (defined in Table 2) in our data cen- 
ter. Not surprisingly, the overall network performance of 
the data center is good. For example, only 0.82% of all 
the connections were receiver limited during their life- 
times. However, there are two key problems that the op- 
erators should address: 


Operators should focus on the small fraction of appli- 
cations suffering from significant performance prob- 
lems. Several connections/applications have severe per- 
formance problems. For example, about 0.11% of the 
connections are receiver-window limited essentially all 
the time. Even though 0.11% sounds like a small num- 
ber, when 8K machines are each running many connec- 
tions, there is almost always some connection or applica- 
tion experiencing bad performance. These performance 
problems at the “tail” of the distribution also constrain 
the total load operators are willing to put in the data cen- 
ter. Operators should look at the SNAP logs of these 
connections and work with the developers to improve the 
performance of these connections so that they can safely 
“ramp up” the utilization of the data center. 


Operators should disable delayed ACK, or signifi- 
cantly reduce Delayed ACK timeout: About two- 
thirds of the connections experienced delayed ACK 
problems. Nearly 2% of the connections suffer from 
delayed-ACKs for more than 99.9% of the time. We 
manually explore the delay-sensitive services, and count 
the percentage of connections that have delayed ACK. 
Unfortunately, about 136 delay-sensitive applications 
have experienced delayed ACKs. Packets that have de- 
layed ACK would experience an unnecessary increase 
of latency by 200 ms, which is three orders of magni- 
tude larger than the propagation delay in the data cen- 
ter and well exceeds the latency bounds for these ap- 
plications. Since delayed ACK causes many problems 
for data-center applications, the operators are consider- 
ing disabling delayed ACK or significantly reducing the 
delayed ACK timeout. The problems of delayed ACK 
for data center applications are also observed in [16]. 
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Figure 4: # of fast retransmissions and timeouts over time. 
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Figure 5: Comparing #FastRetrans and #Timeouts of flows 
with different throughput. 


5.2 Packet Loss 


Operators should schedule backup jobs more care- 
fully to avoid triggering network congestion: Figure 4 
shows the number of fast retransmissions and timeouts 
per second over time. The percentage of retransmitted 
bytes increases between 2 am and 4 am. This is because 
most backup applications with large bulk transfers are 
initiated in this time period. 


Operators should reduce the number and effect of 
packet loss (especially timeouts) for low-rate flows: 
SNAP data shows that about 99.8% of the connections 
have low throughput (< 1 MB/sec). Although these 
low-rate flows do not consume much bandwidth and are 
usually not the cause of network congestion, they are 
significantly affected by network congestion. Figure 5 
is a scatter plot that shows the ratio of of fast retrans- 
missions to timeouts vs. the connection sending rate. 
Each point in the graph represents one polling interval 
of one connection. Low-rate flows usually experience 
more timeouts than fast retransmission because they do 
not have multiple packets in flight to trigger triple du- 
plicate ACKs. Timeouts, in turn, limit the throughput 
of these flows. In contrast, high-rate flows experience 
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% of conn. with prob. #Apps with prob. 

for >X% of time for >X% of time 

Performance limitation >0 | >25% | >50% | >75% | >99.9% | > 5% > 50% 
Sender app limited 557 
Send buffer limited 
Congestion 6 
Receiver window limited 
Delayed ACK 65.71% | 33.20% | 10.10% | 3.21% 1.82% 154 144 
(belong to delay sensitive apps) | 63.52% | 32.82% | 9.71% | 3.01% 1.61% 136 129 


Table 4: Percentage of connections and number of applications that have different TCP performance limitations. 


more fast retransmission than timeouts and can quickly 
recover from packet losses achieving higher throughput 
(> 1 MB/sec). 


5.3 Send Buffer and Receiver Window 


Operators should allow the TCP stack to automatically 
tune the send buffer and receiver window sizes, and con- 
sider the following two factors: 


More send buffer problems on machines with more 
connections: SNAP reports correlated send buffer prob- 
lems on hosts with more than 200 connections. This is 
because the larger the send buffer for each connection, 
the more memory is required for the machine. As a re- 
sult, the developers of different applications on the same 
machine are cautious it setting the size of the send buffer; 
most use the default size of 8K, which 1s far less than the 
delay-bandwidth product in the data center and thus is 
more likely to become the performance bottleneck. 


Mismatch between send buffer and receiver window 
size: SNAP logs the announced receiver window size 
when the connection is receiver limited. From the log 
we see that 0.1% of the total time when the senders in- 
dicate that their connections are bottlenecked by the re- 
ceiver window, the receiver actually announced a 64 KB 
window. This is because the send buffer is larger than 
the announced receiver size, so the sender 1s still bottle- 
necked by the receiver window. 


To fix the send-buffer problems in the short term, 
SNAP could help developers to decide what send 
buffer size they should set in an online fashion. 
SNAP logs the congestion window size (CWin), 
the amount of data the application expect to send 
(WriteBytes), and the announced receiver window 
size (RWin) for all the connections. Developers can 
use this information to size the send buffer based 
on the total resources (e.g., set the send buffer size to 


Cwinthisconn * TotalSendBuf ferMemory/ >> CWin). 


They can also evaluate the effect of their change using 
SNAP. In the long term, operators should have the 
TCP stack automatically tune both the send-buffer and 
receiver-window sizes for all the connections (e.g., [6]). 
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6 Performance Problems Caught by SNAP 


In this section, we show a few examples of perfor- 
mance problems caught by SNAP. In each example, we 
first show how the performance problem is exposed by 
SNAP’s analysis of socket and TCP logs into perfor- 
mance classifications and then correlation across connec- 
tions. Next, we explain how SNAP’s reports help guide 
developers to identify quickly the root causes. Finally, 
we discuss the developer’s fix or proposed fix to these 
problems. For most examples, we spent a few hours or 
days to discuss with developers to understand how their 
programs work and to discover how their programs cause 
the problems SNAP detects. It then took several days or 
weeks to iterate with developers and operators to find out 
the possible alternative ways to achieve their programing 
goals. 


6.1 Sending Pattern/Packet Loss Issues 


Spreading application writes over multiple connec- 
tions lowers throughput: When correlating perfor- 
mance problems across connections from the same appli- 
cation, SNAP found one application whose connections 
always experienced more timeouts (diff(#Timeout)) than 
fast retransmission (diff(#FastRetrans)) especially when 
the WriteBytes is small. For example, SNAP reported re- 
peated periods where one connection transferred an av- 
erage of five requests per second with a size of 2 KB - 20 
KB, while experiencing approximately ten timeouts but 
no fast retransmissions. 

The developers were expecting to obtain far more than 
five requests per second from their system, and when this 
report showing small writes and timeouts was shown to 
them the cause became clear. The application sends re- 
quests to a server and waits for responses. Since some 
requests take longer to process than others and devel- 
opers wanted to avoid having to implement request IDs 
while still avoiding head-of-line blocking, they open two 
connections to the server and place new requests on 
whichever connection is unused. 

However, spreading the application writes over two 
connections meant that often there were not enough out- 
standing data on a connection to cause three duplicate 
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ACKs and trigger fast retransmission when a packet was 
lost. Instead, TCP fell back to its slower timeout mecha- 
nism. 

To fix the problem, the application could send all re- 
quests over a single connection, give requests a unique 
ID, and use pools of worker threads at each end.!° This 
would improve the chances there is enough data in flight 
to trigger fast retransmission when packet loss occurs. 


Congestion window failing to prevent sudden bursts: 
SNAP discovered that some connections belonging to an 
application frequently experience packet loss (#FastRe- 
trans and #Timeout are both high, and correlate strongly 
to the application and across time). SNAP’s logs expose 
a time sequence of socket write logs (WriteBytes) and 
TCP statistics (Cwin) showing that before/during the in- 
tervals where packet loss occurs, there is a single large 
socket write call after an idle period. TCP immediately 
sends out the data in one large chunk of packets because 
the congestion window is large, but it experiences packet 
losses. For example, one application makes a socket call 
with WriteBytes > 100 MB after an idle period of 3 sec- 
onds, the Cwin is 64 KB, and the traffic burst leads to a 
bunch of packet losses. 

The developers told us they use a persistent connec- 
tion to avoid three-way handshake for each data trans- 
fer. Since “slow start restart” is disabled, the congestion 
window size does not age out and remains constant until 
there is a packet loss. As a result, the congestion window 
no longer indicates the carrying capacity of the network, 
and losses are likely when the application suddenly sends 
a congestion window worth of data. 

Interestingly, the developers are opposed to enabling 
slow start restart, and they intentionally manipulate the 
congestion window in an attempt to reduce latency. For 
example, if they send 64 KB data, and the congestion 
window is small (e.g., 1 MSS), they need at multiple 
round-trip times to finish the data transfer. But if they 
keep the congestion window large, they can transfer the 
data with one RTT. In order to have a large congestion 
window, they first make a few small writes when they set 
up the persistent connection. 

To reduce both the network congestion and delay, we 
need better scheduling of traffic across applications, al- 
lowing delay-sensitive applications to send traffic bursts 
when there is no network congestion, but pacing the traf- 
fic if the network is highly used. The feedback mecha- 
nism proposed in DCTCP [17] could be applied here. 


Delayed ACK slows recovery after a retransmission 
timeout: SNAP found that one application frequently 


'ONote that the application should use a single connection because 
its requests are relatively small. For those applications that have a large 
amount of data to transfer for each request, they still have to use two 
connections to avoid head of line blocking during the network transfer. 
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Figure 6: Delayed ACK after a retransmission timeout. 


had two problems (timeout and delayed ACK) at almost 
the same time. As shown in Figure 6, when the fourth 
packet of the transferred data is lost, the TCP sender 
waits for a retransmission timeout (because there are 
not enough following packets to trigger triple-duplicate 
ACKs). However, the congestion window drops to one 
after the retransmission. As a result, TCP can only send 
a single packet, and the receiver waits for a delayed ACK 
timeout before acknowledging the packet. Meanwhile, 
the sender cannot increase its sending window until it 
receives the ACK from the receiver. To avoid this, devel- 
opers are discussing the possibility of dropping the con- 
gestion window down to two packets when a retransmis- 
sion timeout occurs. Disabling delayed ACK is another 
option. 


6.2 Buffer management and Delayed ACK 


Some developers do not manage the application buffer 
and the socket send buffer appropriately, leading to bad 
interactions between buffer management and delayed 
ACK. 


Delayed ACK caused by setting send buffer to zero: 
SNAP reports show that some applications have delayed 
ACK problems most of the time and these applications 
had set their send socket buffer length to 0. __Investi- 
gation found that these applications set the size of the 
socket send buffer to zero in the expectation that it will 
decrease latency because data is not copied to a kernel 
socket buffer, but sent directly from the user space buffer. 
However, when send buffer is zero, the socket layer locks 
the application buffer until the data is ACK’d by the re- 
ceiver so that the socket can retransmit the data in case 
a packet is lost. As a result, additional socket writes are 
blocked until the previous one has finished. 

Whenever an application writes data that results in an 
odd number of packets being sent, the last packet is not 
ACK’d until the delayed ACK timer expires. This ef- 
fectively blocks the sending application for 200 ms and 
can reduce application throughput to 5 writes per second. 
One team attempted to improve application performance 
by shrinking the size of their messages, but ended up cre- 
ating an odd number of packets and triggering this issue 
— destroying the application’s performance instead of 
helping it. After the developers increased the send buffer 
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Figure 7: Performance problem in pipeline communication. 


size, throughput returned to normal. 


Delayed ACK affecting throughput: SNAP reports 
showed that an application was writing small amounts of 
data to the socket (WriteBytes) and its connections expe- 
rienced both delayed ACK and sender application limited 
issues. For example, during 30 minutes, the application 
wrote 10K records at only five records per second and 
with a the record size of 20-100 Bytes. 

The developers explained theirs is a logging applica- 
tion where the client uploads records to a server, and 
should be able generate far more than five records per 
second. Looking into the code with the developers, 
we found three key problems in the design: (1) Block- 
ing write: to simplify the programming, the client does 
blocking writes and the server does blocking reads. (11) 
Small receive buffer: The server calls recv() in a loop 
with a 200 bytes buffer in hopes that exactly one record 
is read in each receive call. (iu) Send buffer is set to zero: 
Since the application is delay-sensitive, the developer set 
send buffer size to zero. The application records are 20- 
100 Bytes — much less than the MSS of 1460 Bytes. Ad- 
ditionally, Nagle’s algorithm forces the socket to wait for 
an ACK before it can send another packet (record).'! As 
a result, the single packet containing each record always 
experience delayed ACK, leading to a throughput of only 
five records per second. To address this problem while 
still avoiding the buffer copying in memory, developers 
changed the sender code to write a group of requests each 
time. Throughput improved to 10K requests/sec after the 
change—a factor of 5000 improvement. 


Delayed ACK affecting performance for pipelined ap- 
plications: By correlating connections to the same ma- 
chine, SNAP found two connections with performance 
problems that co-occur repeatedly: SNAP classified one 


'l'A similar performance problem caused by interactions between de- 
layed ACK and Nagle is discussed in [10]. 
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connection as having a significant delayed ACK problem 
and the other as having sender application problems. 

Developers told us that these two connections belong 
to the same application and form a pipeline pattern (Fig- 
ure 7). There is a proxy that sits between the clients 
and servers and serves as a load balancer. The proxy 
passes requests from the client to the server, fetches a se- 
quence of responses from the server, and passes them to 
the client. SNAP finds such a strong correlation between 
the delayed ACK problem and the receiver limited prob- 
lem because both stem from the passing of the messages 
through the proxy. 

After looking at the code, developers figured out that 
the proxy uses a single thread and a single buffer for both 
the client and the server. The proxy waits for the ACK 
of every transfer (one packet in each transfer most of 
the time) before fetching a new response data from the 
server.'* When the developers changed the proxy to use 
two different threads with one fetching responses from 
the server and another sending responses to the client and 
a response queue between the two threads, the 99% tail 
of the request processing time drops from 200 ms to 10 
ms. 


6.3 Other Problems 


SNAP has also detected other problems such as switch 
port failure (significant correlated packet losses across 
multiple connections sharing the same switch port), re- 
ceiver window negotiation problems as reported in [14] 
(connections are always receiver window limited while 
receiver window size stays small), receiver not reading 
the data fast enough (receiver window limited), and poor 
latency caused by Nagle algorithm (sender application 
limited with small WriteBytes 


7 Reducing SNAP CPU Overhead 


To run in real time on all the hosts in the data center, 
SNAP must keep the CPU overhead and data volume 
low. The volume of data is small because (1) SNAP 
logs socket calls and TCP statistics instead of other high- 
overhead data such as packet traces and (11) SNAP only 
logs the TCP statistics when there is a performance prob- 
lem. To reduce CPU overhead, SNAP allows the opera- 
tors to set the target percentage of CPU usage on each 
host. SNAP stays within a given CPU budget by dynam- 
ically tuning the polling rate for different connections. 


The proxy is using the HTTP.sys library without setting the 
HTTP_SEND_RESPONSE_FLAG_BUFFER-_DATA flag [18], which 
waits for the ACK from the client before sending a “send complete” 
signal to the application. By waiting for the ACK, HTTP.sys can make 
sure the application send buffer is not overwritten until the data is suc- 
cessfully transferred. 
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Figure 8: The CPU overhead of polling TCP statistics (poll) 
and reading TCP table (rt) with different number of connec- 
tions (10, 100, 1K, 5K) and different intervals (from 50 ms to 
10 sec). 
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Figure 9: Number of connections per machine. 


CPU Overhead of Profiling Since SNAP collects logs 
for all the connections at the host, the overhead of SNAP 
consists of three parts: logging socket calls, reading the 
TCP table, and polling TCP statistics. 


Logging socket calls: In our data center, the cost of turn- 
ing on the event tracing for socket logging is a median of 
1.6% of CPU capacity [19]. 


Polling CPU statistics and reading TCP table: The CPU 
overhead of polling TCP statistics and reading the TCP 
table depends on the polling frequency and the number 
of connections on the machine. Figure 8 plots the CPU 
overhead on a 2.5 GHz Intel Xeon machine. If we poll 
TCP statistics for 1K connections at 500 millisecond in- 
terval, the CPU overhead is less than 5%. The CPU over- 
head of reading the TCP table is similar. 

The CPU overhead is closely related to the number 
of connections on each machine. Figure 9 takes a snap- 
shot of the distribution of the number of established con- 
nections per machine. There are at most 10K estab- 
lished sockets and a median of 150. This means oper- 
ators can configure the interval of reading TCP table in 
most machines to be 500 millisecond or one second to 
keep the CPU overhead lower than 5%.!° Since most of 


13We read TCP tables at 500 millisecond interval in our data collec- 


the connections in our data center are long-lived connec- 
tions (e.g., persistent HTTP connections), we can read 
the TCP table at a lower frequency compared to TCP 
statistics polling. For the machines with many connec- 
tions, we need to carefully adjust the polling rate of TCP 
statistics for each connection to achieve a tradeoff be- 
tween diagnosis accuracy and the CPU overhead. 


Dynamic Polling Rate Tuning To achieve the best 
tradeoff between CPU overhead and accuracy, operators 
can first configure [¢ py (uc py) to be the lower (upper) 
bound of the CPU percentage used by SNAP. We then 
propose an algorithm to dynamically tune the polling rate 
for different connections to keep CPU overhead between 
the two bounds. The basic idea of the algorithm is to 
have high polling rate for those connections that are hav- 
ing performance issues and have low polling rate for the 
others. 

We start by polling all the connections on one host at 
the same rate. If the current CPU overhead is below 
lcpu, we pick a connection that has the most perfor- 
mance problems in the past Tpistory time, and increase 
its polling rate for more detailed data. Similarly if the 
current CPU overhead is above ucpy, we pick a con- 
nection that has the least performance problems in the 
past Thistory time, and decrease its polling rate for more 
detailed data. Note that a lower polling rate introduces 
lower diagnosis accuracy. We can still catch those perfor- 
mance problems with the cumulative counters, but may 
miss some problems that rely on snapshots to detect. 


S$ Related Work 


Previous work in diagnosing performance problems fo- 
cuses on either the application layer or the network 
layer. SNAP addresses the interactions between them 
that cause particularly insidious performance issues. 

In the application layer, prior work has taken several 
approaches: instrumenting application code [20, 21, 22] 
to find the causal path of problems, inferring the abnor- 
mal behaviors from history logs [11, 12], or identifying 
fingerprints of performance problems [23]. In contrast, 
SNAP focuses on profiling the interactions between ap- 
plications and the network and diagnosing network per- 
formance problems, especially ones that arise from those 
interactions. 

In the network layer, operators use network moni- 
toring tools (e.g., switch counters) and active probing 
tools (ping, traceroute) to pinpoint network problems 
such as switch failures or congestion. To diagnose net- 
work performance problems, capture and analysis of 
packet traces remains the gold-standard. T-RAT [24] 
uses packet traces to diagnosis throughput bottlenecks in 


tion. 
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Internet traffic. Tcpanaly [4] uses packet traces to diag- 
nose TCP stack problems. Others [25, 26] also infer the 
TCP performance and its problems from packet traces. 
In contrast, SNAP focuses on the multi-tier applications 
in data centers where it has access to the network stack, 
enabling us to create simple algorithms based on coun- 
ters far cheaper to collect than packet traces to expose 
the network performance problems of the applications. 


9 Conclusion 


SNAP combines socket-call logs of the application’s de- 
sired data-transfer behaviors with TCP statistics from the 
network stack that highlight the delivery of data. SNAP 
leverages the knowledge of topology, routing, and ap- 
plication deployment in the data center to correlate per- 
formance problems among connections, to pinpoint the 
congested resource or problematic software component. 

Our experiences in the design, development, and de- 
ployment of SNAP demonstrate that it 1s practical to 
build a lightweight, generic profiling tool that runs con- 
tinuously in the entire data center. Such a profiling tool 
can help both operators and developers in diagnosing 
network performance problems. 

With applications in data centers getting more com- 
plex and more distributed, the challenges of diagnosing 
the performance problems between the applications and 
the network will only grow in importance in the years 
ahead. For future work, we hope to further automate 
the diagnosis process to save developers’ efforts by ex- 
ploring the appropriate variables to monitor in the stack, 
studying the dependencies between the variables SNAP 
collects, and combining SNAP reports with automatic 
analysis of application software. 
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Efficiently Measuring Bandwidth at All Time Scales 
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Abstract Microbursts cause problems because data center link 


The need to identify correlated traffic bursts at various, 
and especially fine-grain, time scales has become press- 
ing in modern data centers. The combination of Giga- 
bit link speeds and small switch buffers have led to “mi- 
crobursts’’, which cause packet drops and large increases 
in latency. Our paper describes the design and imple- 
mentation of an efficient and flexible end-host bandwidth 
measurement tool that can identify such bursts in addi- 
tion to providing a number of other features. Managers 
can query the tool for bandwidth measurements at reso- 
lutions chosen after the traffic was measured. The algo- 
rithmic challenge is to support such a posteriori queries 
without retaining the entire trace or keeping state for all 
time scales. We introduce two aggregation algorithms, 
Dynamic Bucket Merge (DBM) and Exponential Buck- 
eting (EXPB). We show experimentally that DBM and 
EXPB implementations in the Linux kernel introduce 
minimal overhead on applications running at 10 Gbps, 
consume orders of magnitude less memory than event 
logging (hundreds of bytes per second versus Megabytes 
per second), but still provide good accuracy for band- 
width measures at any time scale. Our techniques can be 
implemented in routers and generalized to detect spikes 
in the usage of any resource at fine time scales. 


1 Introduction 


How can a manager of a computing resource detect 
bursts in resource usage that cause performance degra- 
dation without keeping a complete log? The problem 
is one of extracting a needle from a haystack; the prob- 
lem gets worse as the needle gets smaller (as finer-grain 
bursts cause drops in performance) and the haystack gets 
bigger (as the consumption rate increases). While our 
paper addresses this general problem, we focus on de- 
tecting bursts of bandwidth usage, a problem that has re- 
ceived much attention [6, 16, 18] in modern data centers. 

The simplest definition of a microburst is the transmis- 
sion of more than B bytes of data in a time interval ¢ on a 
single link, where ¢ is in the order of 100’s of microsec- 
onds. For input and output links of the same speed, bursts 
must occur on several links at the same time to overrun 
a switch buffer, as in the Incast problem [8, 16]. Thus, 
a more useful definition is the sending of more than 5 
bytes in time ¢ over several input links that are destined 
to the same output switch port. This general definition re- 
quires detecting bursts that are correlated in time across 
several input links. 


speeds have moved to 10 Gbps while commodity switch 
buffers use comparatively small amounts of memory 
(Mbytes). Since high-speed buffer memory contributes 
significantly to switch cost, commodity switches con- 
tinue to provision shallow buffers, which are vulnerable 
to overflowing and dropping packets. Dropped packets 
lead to TCP retransmissions which can cause millisecond 
latency increases that are unacceptable in data centers. 


Administrators of financial trading data centers, for 1n- 
stance, are concerned with the microburst phenomena [4] 
because even a latency advantage of | millisecond over 
the competition may translate to profit differentials of 
$100 million per year [14]. While financial networks 
are a niche application, high-performance computing 1s 
not. Expensive, special-purpose switching equipment 
used in high-performance computing (e.g. Infiniband 
and FiberChannel) is being replaced by commodity Eth- 
ernet switches. In order for Ethernet networks to com- 
pete, managers need to identify and address the fine- 
grained variations in latencies and losses caused by mi- 
crobursts. At the core of this problem is the need to iden- 
tify the bandwidth patterns and corresponding applica- 
tions causing these latency spikes so that corrective ac- 
tion can be taken. 


Efficient and effective monitoring becomes increas- 
ingly difficult as faster links allow very short-lived phe- 
nomenon to overwhelm buffers. For a commodity 24- 
port 10 Gbps switch with 4 MB of shared buffer, the 
buffer can be filled (assuming no draining) in 3.2 msec by 
a single link. However, given that bursts are often corre- 
lated across several links and buffers must be shared, the 
time scales at which interesting bursts occur can be ten 
times smaller, down to 100’s of jus. Instead of 3.2 msec, 
the buffer can overflow in 320 ps if 10 input ports each 
receive 0.4 MB in parallel. Assume that the strategy to 
identify correlated bursts across links is to first identify 
bursts on single links and then to observe that they are 
correlated in time. The single link problem 1s then to ef- 
ficiently identify periods of length t where more than B 
bytes of data occur. Currently, ¢ can vary from hundreds 
of microseconds to milliseconds and 6 can vary from 
100’s of Kbytes to a few Mbytes. Solving this problem 
efficiently using minimal CPU processing and logging 
bandwidth is one of the main concerns of this paper. 


Although identifying “bursts” on a single link for a 
range of possible time scales and byte thresholds is chal- 


NSDI 711: 8th USENIX Symposium on Networked Systems Design and Implementation 


71 


a2 


lenging, the ideal solution should do two more things. 
First, the solution should efficiently extract flows respon- 
sible for such bursts so that a manger can reschedule or 
rate limit them. Second, the tool should allow a manager 
to detect bursts correlated in time across links. While 
the first problem can be solved using heavy-hitter tech- 
niques [15], we briefly describe some new ideas for this 
problem in our context. The second problem can be 
solved by archiving bandwidth measurement records in- 
dexed by link and time to a relational database which 
can then be queried for persistent patterns. This requires 
an efficient summarization technique so that the archival 
storage required by the database is manageable. 

Generalizing to Bandwidth Queries: Beyond identify- 
ing microbursts, we believe that modeling traffic at fine 
time scales is of fundamental importance. Such model- 
ing could form the basis for provisioning NIC and switch 
buffers, and for load balancing and traffic engineering at 
fine time scales. While powerful, coarse-grain tools are 
available, the ability to flexibly and efficiently measure 
traffic at different, and especially fine-grain, resolutions 
is limited or non-existent. 

For instance, we are unable to answer basic ques- 
tions such as: what is the distribution of traffic bursts? 
At which time-scale did the traffic exhibit burstiness? 
With the identification of long-range dependence (LRD) 
in network traffic [9], the research community has un- 
dergone a mental shift from Poisson and memory-less 
processes to LRD and bursty processes. Despite its 
widespread use, however, LRD analysis is hindered by 
our inability to estimate its parameters unambiguously. 
Thus, our larger goal is to use fine-grain measurement 
techniques for fine-grain traffic modeling. 

While it is not difficult to choose a small number of 
preset resolutions and perform measurements for those, 
the more difficult and useful problem is to support traffic 
measurements for all time scales. Not only do measure- 
ment resolutions of interest vary with time (as in burst 
detection), but in many applications they only become 
critical after the fact, that is, after the measurements have 
already been performed. Our paper describes an end-host 
bandwidth measurement tool that succinctly summarizes 
bandwidth information and yet answers general queries 
at arbitrary resolutions without maintaining state for all 
time scales. 

Some representative queries (among many) that we 
wish such a tool to support are the following: 


1. What is the maximum bandwidth used at time scale 
t? 


2. What is the standard deviation and 95th percentile 
of the bandwidth at time scale t? 


3. What is the coarsest time scale at which bandwidth 
exceeds threshold L? 
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In these queries, the query parameters t or L are cho- 
sen a posteriori — after all the measurements have been 
performed, and thus require supporting all possible reso- 
lutions and bandwidths. 


Existing techniques: All the above queries above can 
be easily answered by keeping the entire packet trace. 
However, our data structures take an order of magni- 
tude less storage than a packet trace (even a sampled 
packet trace) and yet can answer flexible queries with 
good accuracy. Note that standard summarization tech- 
niques (including simple ones like SNMP packet coun- 
ters [1]) and more complex ones (e.g., heavy-hitter de- 
termination [13]) are very efficient in storage but must 
be targeted towards a particular purpose and at a fixed 
time scale. Hence, they cannot answer flexible queries 
for arbitrary time scales. 


Note that sampling | in N packets, as in Cisco Net- 
Flow [2], does not provide a good solution for bandwidth 
measurement queries. Consider a 10 Gbps link with an 
average packet size of 1000 bytes. This link can produce 
10 million packets per second. Suppose the scheme does 
1 in 1000 packet sampling. It can still produce 10,000 
samples per second with say 6 bytes per sample for time- 
stamp and packet size. To identify bursts of 1000 pack- 
ets of 1500 bytes each (1.5 MB), any algorithm would 
look for intervals containing | packet and scale up by the 
down sampling factor of 1000. The major problem is that 
this causes false positives. If the trace is well-behaved 
and has no bursts in any specified period (say 10 msec), 
the scaling scheme will still falsely identify 1 in 1000 
packets as being part of bursts because of the large scal- 
ing factor needed for data reduction. Packet sampling, 
fundamentally, takes no account of the passage of time. 


From an information-theoretic sense, packet traces, 
are inefficient representations for bandwidth queries. 
Viewing a trace as a time series of point masses (bytes in 
each packet), it is more memory-efficient to represent the 
trace as a series of time intervals with bytes sent per 1n- 
terval. But this introduces the new problem of choosing 
the intervals for representation so that bandwidth queries 
on any interval (chosen after the trace has been summa- 
rized) can be answered with minimal error. 


Our first scheme builds on the simple idea that for any 
fixed sampling interval, say 100 microseconds, one can 
easily compute traffic statistics such as max or Standard 
Deviation by a few counters each. By exponentially in- 
creasing the sampling interval, we can span an aggre- 
gation period of length 7’, and still compute statistics at 
all time scales from microseconds to milliseconds, using 
only O(log 7’) counters. We call this approach Expo- 
nential Bucketing (EXPB). The challenge in EXPB 1s to 
avoid updating all log 7’ counters on each packet arrival 
and to prove error bounds. 


Our second idea, dubbed Dynamic Bucket Merge 
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Figure 1: Example Deployment. End hosts and network 
devices implementing EXPB and DBM push output data 
over the network to a log server. Data at the server can 
be monitored and visualized by administrators then col- 
lapsed and archived to long-term, persistent storage. 


(DBM), constructs an approximate streaming histogram 
of the traffic so that bursts stand out as peaks in this his- 
togram. Specifically, we adaptively partition the traffic 
into & intervals/buckets, in such a way that the periods 
of heavy traffic map to more refined buckets than those 
of low traffic. The time-scales of these buckets provide a 
“visual history” of the burstiness of the traffic—the nar- 
rower the bucket in time, the burstier the traffic. In partic- 
ular, DBM is well-suited for identifying not only whether 
a burst occurred, but how many bursts, and when. 

System Deployment: Exponential Bucketing and Dy- 
namic Bucket Merge have low computational and stor- 
age overheads, and can be implemented at multi-gigabit 
speeds in software or hardware. As shown in Figure 1, 
we envision a deployment scenario where both end hosts 
and network devices record fine-grain bandwidth sum- 
maries to a centralized log server. We argue that even 
archiving to a single commodity hard disk, administra- 
tors could pinpoint, to the second, the time at which cor- 
related bursts occurred on given links, even up to a year 
after the fact. 

This data can be indexed using a relational database, 
allowing administrators to query bandwidth statistics 
across links and time. For example, administrators could 
issue queries to “Find all bursts that occurred between 
10 and 11 AM on all links in Set S”. Set S could be the 
set of input links to a single switch (which can reveal In- 
cast problems) or the path between two machines. Band- 
width for particular links can then be visualized to further 
delineate burst behavior. The foundation for answering 
such queries is the ability to efficiently and succinctly 
summarize the bandwidth usage of a trace in real-time, 
the topic of this paper. 

We break down the remainder of our work as fol- 
lows. We begin with a discussion of related algorithms 
and systems in Section 2. Section 3 illustrates the Dy- 


namic Bucket Merge and Exponential Bucketing algo- 
rithms, both formally and with examples. We follow 
with our evaluations in Section 4, describe the implica- 
tions for a system like Figure | in Section 5, and con- 
clude in Section 6. 


2 Related Work 


Tcpdump [5] is a mature tool that captures a full log of 
packets at the endhost, which can be used for a wide va- 
riety of statistics, including bandwidth at any time scale. 
While flexible, tcpdump consumes too much memory for 
continuous monitoring at high speeds across every link 
and for periods of days. Netflow [2] can capture packet 
headers in routers but has the same issues. While sam- 
pled Netflow reduces storage, configurations with sub- 
stantial memory savings cannot detect bursts without re- 
sulting in serious false positives. SNMP counters [1], on 
the other hand, provide packet and byte counts but can 
only return values at coarse and fixed time scales. 

There are a wide variety of summarization data struc- 
tures for traffic streams, many of which are surveyed 
in [15]. None of these can directly be adapted to solve 
the bandwidth problem at all time scales, though solu- 
tions to quantile detection do solve some aspects of the 
problem [15]. For example, classical heavy-hitters [13] 
measures the heaviest traffic flows during an interval. 
By contrast, we wish to measure “heavy-hitting sub- 
intervals across time’, so to speak. However, heavy- 
hitter solutions are complementary in order to identify 
flows that cause the problem. The LDA data struc- 
ture [12] is for a related problem — that of measuring 
average latency. LDA is useful for directly measuring 
latency violations. Our algorithms are complementary in 
that they help analyze the bandwidth patterns that cause 
latency violations. 

DBM is inspired by the adaptive space partitioning 
scheme of [11], but is greatly simplified, and also con- 
siderably more efficient, due to the time-series nature of 
packet arrivals. 


3 Algorithms 


Suppose we wish to perform bandwidth measurements 
during a time window [0,7'|, assuming, without loss of 
generality, that the window begins at time zero. We as- 
sume that during this period NV packets are sent, with p; 
being the byte size of the 7th packet and t; being the time 
at which this packet is logged by our monitoring system, 
for? = 1,2,...,.N. These packets are received and pro- 
cessed by our system as a stream, meaning that the 2th 
packet arrives before the 7th packet, for any 2 < 7. 

The bandwidth is a rate, and so converting our ob- 
served sequence of NV packets into a quantifiable band- 
width usage requires a time scale. Since we wish to 
measure bandwidth at different time scales, let us first 
make precise what we mean by this. Given a time 
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scale (or granularity) A, where 0 < A < Ty, we di- 
vide the measurement window |0, 7’ into sub-intervals 
of length A, and aggregate all those packets that are sent 
within the same interval. In this way, we arrive at a se- 
quence SA = (S1,S2,...,5%), Where s; is the sum of 
the bytes sent during the sub-interval ((i — 1) A, 7A], and 
k = [T/A] is the number of such intervals.! 

Therefore, every choice of A leads to a corresponding 
sequence Sa, which we interpret as the bandwidth use at 
the temporal granularity A. All statistical measurements 
of bandwidth usage at time scale A correspond to statis- 
tics over this sequence S,. For instance, we can quantify 
the statistical behavior of the bandwidth at time scale A 
by measuring the mean, standard deviation, maximum, 
median, quantiles, etc. of Sa. 

In the following, we describe two schemes that can 
estimate these statistics for every a posteriori choice of 
the time scale A. That is, after the packet stream has 
been processed by our algorithms, the users can query for 
an arbitrary granularity A and receive provable quality 
approximations of the statistics for the sequence Sa. 

Our first scheme, DBM, is time scale agnostic, and 
essentially maintains a streaming histogram of the val- 
ues $1, 52,...,5,, by adaptively partitioning the period 
(0, Z|. Our second scheme EXPB explicitly computes 
Statistics for a priori settings of A, and then uses them 
to approximate the statistics for the queried value of A. 

Since the two schemes are quite orthogonal to each 
other, it is also possible to use them both in conjunc- 
tion. We give worst-case error guarantees for both of 
the schemes. Both schemes are able to compute the 
mean with perfect accuracy and estimate the other statis- 
tics, such as the maximum or standard deviation, with 
a bounded error. The approximation error for the DBM 
scheme is expressed as an additive error, while the EXPB 
scheme offers a multiplicative relative error. In particu- 
lar, for the DBM scheme, the estimation of the maximum 
or standard deviation is bounded by an error term of the 
form O(c€B), where 0 < € < 1 is a user-specified pa- 
rameter dependent on the memory used by the data struc- 
ture, and B = yy p; 1S the total packet mass over the 
measurement window. In the following, we describe and 
analyze the DBM scheme, followed by a description and 
analysis of the EXPB scheme. 


3.1 Dynamic Bucket Merge 


DBM maintains a partition of the measurement window 
(0, Z| into what we call buckets. In particular, a m- 
bucket partition {b1, b2,...,5m}, is specified by a se- 
quence of time instants ¢(b;), with 0 < t(b;) < T, 


'To deal with the boundary problem properly, we assume that each 
sub-interval includes its right boundary, but not the left boundary. If we 
assume assume that no packet arrives at time 0, we can form a proper 
non-overlapping partition this way. 
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with the interpretation that the bucket b; spans the inter- 
val (t(b;-1), t(b;)]. That is, t(b;) marks the time when 
the 7th bucket ends, with the convention that t(bg) = 0, 
and t(b,,) = IT’. The number of buckets m is controlled 
by the memory available to the algorithm and, as we 
will show, the approximation quality of the algorithm im- 
proves linearly with m. In the following, our description 
and analysis of the scheme is expressed in terms of m. 
Each bucket maintains O(1) information, typically the 
statistics we are interested in maintaining, such as the 
total number of bytes sent during the bucket. In particu- 
lar, in the following, we use the notation p(b) to denote 
the total number of data bytes sent during the interval 
spanned by a bucket 0. 

The algorithm processes the packet stream 1, po, ..., 
Dn in arrival time order, always maintaining a partition 
of |0, 7] into at most m buckets. (In fact, after the first m 
packets have been processed, the number of buckets will 
be exactly m, and the most recently processed packet lies 
in the last bucket, namely, 6,,.) The basic algorithm is 
quite straightforward. When the next packet p; is pro- 
cessed, we place it into a new bucket 6,1, with time in- 
terval (t;_1, 7')—tecall that t;_1 is the time stamp asso- 
ciated with the preceding packet p;_;. We also note that 
the right boundary of the predecessor bucket b,,, now be- 
comes f;_1 due to the addition of the bucket b,,,1. Since 
we now have m + 1 buckets, we merge two adjacent 
buckets to reduce the bucket count down to m. Several 
different criteria can be used for deciding which buckets 
to merge, and we consider some alternatives later, but in 
our basic scheme we merge the buckets based on their 
packet mass. That is, we merge two adjacent buckets 
whose sum of the packet mass is the smallest over all 
such adjacent pairs. A pseudo-code description of DBM 
is presented in Algorithm 1. 


Algorithm 1: DBM 


1 foreach p; € S do 
2 Allocate a new bucket b; and set p(b;) = Dj 
3 if 2 == m+ 1 then 


5 Merge the two adjacent 6,,, bw+1 for which 
p(bw) + p(bw+1) iS minimum; 

6 end 

7 end 


3.1.1 DBM Example 


To clarify the operation of DBM we give the following 
example, illustrated in Figure 2. 

Suppose that we run DBM with 4 buckets (m = 4), 
each of which stores a count of the number of buckets 
that have been merged into it, the sum of all bytes be- 
longing to it, and the max number of bytes of any bucket 
merged into it. Now suppose that 4 packets have arrived 


USENIX Association 


USENIX Association 


Pkt. Size: 


10 
Count: 1 
Sum: 10 


Buckets: 





Fr 


Count: 1 Count: 1 
Sum: 20 Sum: 40 
enrages ret i 


Bo] [ss] [40] 









Count: 1 
Sum: 35 








Count: 1 
Sum: 10 








Count: 2 Count: 1 
Sum: 30 Sum: 35 





Figure 2: Dynamic Bucket Merge with 4 buckets. Ini- 
tially each bucket contains a single packet and the min 
heap holds the sums of adjacent bucket pairs. When a 
new packet (value = 40) arrives, a 5th bucket is allocated 
and a new entry added to the heap. In the merge step, the 
smallest value (30) is popped from the heap and the two 
associated buckets are merged. Last, we update the heap 
values that depended on either of the merged buckets. 


with masses 10, 20, 35, and 5, respectively. The state of 
DBM at this point is shown at the top of Figure 2. Note 
that Algorithm | required that we merge the buckets with 
the minimum combined sum. Hence, we maintain a min 
heap which stores the sums of adjacent buckets. 

When a fifth packet with a mass of 40 arrives, DBM 
allocates a new bucket for it and updates the heap with 
the sum of the new bucket and its neighbor. 

In the final step, the minimum sum is pulled from the 
heap and the buckets contributing to that sum are merged. 
In this example, the bucket containing mass 10 and 20 are 
merged into a single bucket with a new mass of 30 and 
a max bucket value of 20. Note that we also update the 
values in the heap which included the mass of either of 
the merge buckets. 


3.1.2 DBM Analysis 


The key property of DBM is that it can estimate the total 
number of bytes sent during any time interval. In particu- 
lar, let [¢, t’] be an arbitrary interval, where 0 < t,t’ < T, 
and let p(t, t’) be the total number of bytes sent during 
it, meaning p(t, t’) = S>’, {pi | t < ti < t’}. Then we 
have the following result. 


Lemma 1. The data structure DBM estimates p(t, t’) 
within an additive error O(B/m), for any interval |t, t'|, 


where m is the number of buckets used by DBM and 
a ae 1p; is the total packet mass over the measure- 
ment window |0, T]. 


Proof. We first note that in DBM each bucket’s packet 
mass is at most 2B/(m—1), unless the bucket contains a 
single packet whose mass is strictly larger than 2B /(m— 
1). In particular, we argue that whenever two buckets 
need to be merged, there always exists an adjacent pair 
with total packet mass less than 2B/(m — 1). Suppose 
not. Then, summing the sizes of all (m — 1) pairs of 
adjacent buckets must produce a total mass strictly larger 
than 2(m — 1)B/(m — 1) = 2B, which is impossible 
since in this sum each bucket is counted at most twice, 
so the total mass must be less than 2B. 

With this fact established, the rest of the lemma fol- 
lows easily. In order to estimate p(t, t’), we simply add 
up the buckets whose time spans intersect the interval 
[t,t’]. Any bucket whose interval lies entirely inside 
[t, t’] is accurately counted, and so the only error of esti- 
mation comes from the two buckets whose intervals only 
partially intersect [t, t‘| these are the buckets contain- 
ing the endpoints ¢ and t’. If these buckets have mass 
less than 2B/(m — 1) each, then the total error in esti- 
mation is less than 4B/m, which is O(2). If, on the 
other hand, either of the end buckets contains a single 
packet with large mass, then that packet is correctly in- 
cluded or excluded from the estimation, depending on 
its time stamp, and so there is no estimation error. This 
completes the proof. LJ 


Theorem 1. With DBM we can estimate the maximum 
or the standard deviation of S'x within an additive error 
éB, using memory O(1/é). 


Proof. The proof for the maximum follows easily from 
the preceding lemma. We simply query DBM for time 
windows of length A, namely, (¢A, (¢ + 1)A], for 2 = 
0,1,...,/Z/A], and output the maximum packet mass 
estimated in any of those intervals. In order to achieve 
the target error bound, we use m = - + 1 buckets. 

We now analyze the approximation of the standard de- 
viation. Recall that the sequence under consideration is 
Sa = (S1,82,..-,8%), for some time scale A, where 
s; 18 the sum of the bytes sent during the sub-interval 
((2—1)A, iA], and k = |7'/A] is the number of such in- 
tervals. Let Var(Sq), E(S'a), and E(S% ), respectively, 
denote the variance, mean, and mean of the squares for 
Sa. Then, by definition, we have 


ie 87 
Var(Sa) = E(S4)—E(Sa)? = 51 — E(Sa)? 


Since DBM estimates each s; within an additive error of 
€B, our estimated variance respects the following bound: 
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2 \-(s; + eB)? 

: k; 
However, we can compute E(S', )? exactly, because it is 
just the square of the mean. In order to derive a bound on 
the error of the variance, we assume that k > m, that is, 
the size of the sequence Sq is at least as large as the num- 
ber of buckets in DBM. (Naturally, statistical measure- 
ments are meaningless when the sample size becomes 
too small.) With this assumption, we have 2/k < 2/m, 
and since « = 4/(m — 1), we get that 224 Si < €B, 
which, considering k > 1, yields the following upper 
bound for the estimated variance: 


— E(Sa) 





7 k+1 
< 2 — E(Sa)?+—— e*B? < Var(S,a)+2e7B? 


which implies the claim. LJ 


Similarly, we can show the following result for ap- 
proximating quantiles of the sequence Sa. 


Theorem 2. With DBM we can estimate any quantile of 
S~ within an additive error ¢B, using memory O(1/é). 


Proof. Let 51, 82,...,5,% be the sequence of data in the 
intervals (iA, (¢ + 1)A], for? = 1,2,...,4k = [T/A], 
sorted in increasing order, and let 51, 59,...,5, be the 
sorted estimated sequence for the same intervals. We 
now compute the desired quantile, for instance the 95th 
percentile, in this sequence. Supposing the index of the 
quantile is g, we return S,. We argue that the error of 
this approximation is O(¢B). We do this by estimating 
bounds on the s; values that are erroneously (due to ap- 
proximation) misclassified, meaning reported below or 
equal the quantile when they are actually larger or vice 
versa. If no s; have been misclassified then §, and Ss, 
correspond to the same sample, and by Lemma | the es- 
timated value 5, — s, < €B, hence the claim follows. 
On the other hand, if a misclassification occurred, then 
the sample s, is reported at an index different than q in 
the estimated sequence. Assume without loss of gener- 
ality that the sample s, has been reported as &,, where 
u > q. Then, by the pigeonhole principle, there is at 
least a sample s, (h > gq) that is reported as 5g, d < gq. 
By Lemma 1, 5g — 8s, < eB. Since sq and sp, switched 
ranks in the estimated sequence s, by Lemma | it holds 
that s, — sy < ¢B and 5, — Sq < eB. By assump- 
tion u > q = d, then it follows that s,, > §, > Sq in the 
sorted sequence &, which implies that s, —§q < ¢B. The 
chain of inequalities implies that §, — s, < 3¢B, which 
completes the proof. LI 


Algorithm | can be implemented at the worst-case cost 
of O(log m) per packet, with the heap operation being 
the dominant step. The memory usage of DBM is O(m) 
as each bucket maintains O(1) information. 
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3.1.3. Extensions to DBM for better burst detection 


Generic DBM is a useful oracle for estimating bandwidth 
in any interval (chosen after the fact) with bounded addi- 
tive error. However, one can tune the merge rule of DBM 
if the goal is to pick out the bursts only. Intuitively, if 
we have an aggregation period with & bursts for small k 
(say 10) spread out in a large interval, then ideally we 
would like to compress the large trace to k high-density 
intervals. Of course, we would like to also represent the 
comparatively low traffic adjacent intervals as well, so an 
ideal algorithm would partition the trace into 2k + 1 in- 
tervals where the bursts and ideal periods are clearly and 
even visually identified. We refer to the generic scheme 
discussed earlier that uses merge-by-mass as DBM-—mm, 
and describe two new variants as follows. 


e merge-by-variance (DBM-mv): merges the two ad- 
jacent buckets that have the minimum aggregated 
packet mass variance 


e merge-by-range (DBM-mr): merges the two ad- 
jacent buckets that have the minimum aggregated 
packet mass range (defined as the difference be- 
tween maximum and minimum packet masses 
within the bucket) 


These merge variants can also be implemented in log- 
arithmic time, and require storing O(1) additional infor- 
mation for each bucket (in addition to p(b;)). 

One minor detail is that DBM-mv and DBM-mr 
are sensitive to null packet mass in an interval while 
DBM-—mm 1s not. For these reasons, we make the DBM—mr 
and DBM-mv algorithms work on the sequence defined 
by Sa, where A is the minimum time scale at which 
bandwidth measurements can be queried. Then DBM-mr 
and DBM-mv represents Si, as a histogram on m buck- 
ets, where each bucket has a discrete value for the signal. 
The goal of a good approximation is to minimize its pre- 
dicted value versus the true under some error metric. We 
consider both the Lz norm and the L, norm for the ap- 
proximation error. 


nm 
a (2\i 
Ez =()— |si — 84| )2 (1) 
i=1 
where §; is the approximation for value s;. 
Foo = Max;_,|8; — 3; (2) 


We compare the performance of DBM-mr and 
DBM-mv algorithms with the optimal offline algorithms, 
that is, a bucketing scheme that would find the optimal 
partition of Sa to minimize the FE» or the E,, metric. 
Then, the analysis of [7, 10] can be adapted to yield the 
following results that formally state our intuitive goal of 
picking out m bursts with 2m + 1 pieces of memory. 
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Theorem 3. The L., approximation error of the m- 
bucket DBM-mr is never worse than the corresponding 
error of an optimal m/2-bucket partition. 


Theorem 4. The L2 approximation error of the m- 
bucket DBM-mv is at most \/2 times the corresponding 
error of an optimal m/4-bucket partition. 


3.2 Exponential Bucketing 


Our second scheme, which we call Exponential Buck- 
eting (EXPB), explicitly computes statistics for a priori 
settings of A,,...,A,,, and then uses them to approx- 
imate the statistics for the queried value for any A, for 
A; < A < A,,. We assume that the time scales grow in 
powers of two, meaning that A; = 2'-1A,. Therefore, 
we can assume that the scheme processes data at the time 
scale A;, namely, the sequence Sa, = (51, 52,.--, Sx). 

Conceptually, EXPB maintains bandwidth statistics 
for all m time scales Aj,...,A,,. A naive implemen- 
tation would require updating O(m) counters per (ag- 
gregated) packet s;. However, by carefully orchestrating 
the accumulator update when a new s; is available it is 
possible to avoid spending m updates per measurement 
as shown in Algorithm 2. 

The intuition is as follows. Suppose one is maintaining 
statistics at 100 jus and 200 ps intervals. When a packet 
arrives, we update the 100 ys counter but not the 200 
ys counter. Instead, the 200 jus counter is updated only 
when the 100 us counter is zeroed. In other words, only 
the lowest granularity counter is updated on every packet, 
and coarser granularity counters are only updated when 
all the finer granularity counters are zeroed. 


Algorithm 2: EXPB 


1 sum=< 0,...,0 > (m times) ; 
2 foreach s; do 

3 sum[0]=s;; 

4 j=0; 

6 repeat 

8 updatestat(j,sum|[j]); 

9 if 7 < m then 

10 | sum[j+1 ]+=sum[}j]; 
11 end 

12 sum|[j]=0; 

13 j++; 

14 until « mod 2 40 orj >m; 
15 end 


3.2.1 EXPB Example 


To better understand the EXPB algorithm we now present 
the example illustrated in Figure 3. 

In this example, we maintain 3 buckets (m = 3) each 
of which stores statistics at time scales of 1, 2 and 4 


Time: 1 


Count: 1 
Sum: 10 


Buckets 





A 
Count: 4 
Sum: 5 
Count: 2 
Sum: 40 
Count: 1 
Sum: 70 


Z 3 
Count: 2 Count: 3 
Sum: 20 Sum: 35 


Count: 1 
Sum: 30 









Samples: 


Figure 3: Exponential Bucketing Example. Each of the 
m buckets collects statistics at 2’~' times the finest time 
scale. At the end of each time scale, A;, buckets 1 to 2 
must be updated. Before storing the new sum in a bucket 
7, we first add the old sum into bucket 7 + 1, if it exists. 


time units. Each bucket stores the count of the intervals 
elapsed, the sum of the bytes seen in the current interval, 
and fields to compute max and standard deviation. We la- 
bel the time units along the top and the number of bytes 
accumulated during each interval along the bottom. 

In the first time interval 10 bytes are recorded in the 
first bucket and 10 is pushed to the sum of the second 
bucket. We repeat this operation when 20 is recorded 
in the second interval. Since 2 time units have elapsed, 
we also update the statistics for the A» time scale, and 
add bucket two’s sum to bucket 3. In the third interval 
we update bucket | as before. Finally, at time 4 we up- 
date bucket 2 with the current sum from bucket 1, up- 
date bucket two’s statistics, and push bucket two’s sum 
to bucket 3. Finally, we update the statistics for Ag with 
bucket three’s sum. 


3.2.2 EXPB Analysis 


Algorithm 2 uses O(m) memory and runs in O(k) worst- 
case time, where k = |7'/A,] is the number of inter- 
vals at the lowest time scale of the algorithm. The per- 
interval processing time is amortized constant, since the 
repeat loop starting at Line 6 simply counts the num- 
ber of trailing zeros in the binary representation of 2, for 
allO0 < i< k = T/A. The procedure updatestat() 
called at Line 8 updates in constant time the O(1) infor- 
mation necessary to maintain the statistics for each A,;, 
for 1.32 = Fi. 

We now describe and analyze the bandwidth estima- 
tion using EXPB. Given any query time scale A, we out- 
put the maximum of the bandwidth corresponding to the 
smallest index 7 for which A; > A, and use the sum 
of squared packet masses stored for granularity A; to 
compute the standard deviation. The following lemma 
bounds the error of such an approximation. 
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Lemma 2. With EXPB we can return an estimation of 
the maximum or standard deviation of Sa that is be- 
tween factor 1/2 and 3 from the true value. The bound on 
the standard deviation holds in the limit when the ratio 
E(S2)/E(Sa)? is large. 


Proof. We first prove the result for the statistic maxi- 
mum, and then address the standard deviation. Let I 
be the interval ((7 — 1)A,7zA] corresponding to the time 
scale A in which the maximum value is achieved, and let 
p(I) be this value. Since A; > A, there are at two most 
consecutive intervals [ ; a ta at time scale A, that to- 
gether cover J. By the pigeonhole principle, either [ ; or 
I us must contain at least half the mass of J, and there- 
fore the maximum value at time scale A, is at least 1/2 of 
the maximum value at A. This proves the lower bound 
side of the approximation. In order to obtain a corre- 
sponding upper bound, we simply observe that if J? is 
the interval at time scale A; with the maximum value, 
then I : overlaps with at most 3 intervals of time scale 
A. Thus, the maximum value at time scale A; cannot be 
more than 3 times the maximum at A proving an upper 
bound on the approximation. 

The analysis for the standard deviation follows along 
the same lines, using the observation that stddeva = 
\/ E(S%) — E(Sa)?. An argument similar to the one 
used for the maximum value holds for the approximation 
of E(S%). Then assuming the ratio E(S%)/E(Sa)? 
to be a constant sufficiently greater than 1 implies the 
claim. We omit the simple algebra from this extended 
abstract. LJ 


We note that there is a non-trivial extension of EXPB 
which allows it to work with a set of exponentially in- 
creasing time granularities whose common ratio can be 
any a > 1. This can reduce average error. For a gen- 
eral a > 1, Algorithm 2 cannot be easily adapted, so 
we need a generalization of it that uses an event queue 
while processing measurements to schedule when in the 
future a new measurement of length A; must be sent to 
updatestat(). The details are omitted for lack of space. 


3.3. Culprit Identification 


As mentioned earlier, we do not want to simply identify 
bursts but also to identify the flow (e.g., TCP connection, 
or source IP address, protocol) that caused the burst so 
that the network manager can reschedule or move the of- 
fending station or application. The naive approach would 
be to add a heavy-hitters [13] data structure to each DBM 
bucket, which seems expensive in storage. Instead, we 
modify DBM to include two extra variables per bucket: a 
flowID and a flow count for the flowID. 

The simple heuristic we suggest is as follows. Initially, 
each packet is placed in a bucket, and the bucket’s flowID 
is set to the flowID of its packet. When merging two 
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buckets, if the buckets have the same flowID, then that 
flowID becomes the flowID of the merged bucket and 
the flow counts are summed. If not, then one of the two 
flowIDs is picked with probability proportional to their 
flow counts. Intuitively, the higher count flows are more 
likely to be picked as the main contributor in each bucket 
as they are more likely to survive merges. 

For EXPB, a simple idea is to use a standard heavy- 
hitters structure [13] corresponding to each of the loga- 
rithmic time scales. When each counter is reset, we up- 
date the flowID if the maximum value has changed and 
reinitialize the heavy-hitters structure for the next inter- 
val. This requires only a logarithmic number of heavy- 
hitters structures. Since there appears to be redundancy 
across the structures at each time scale, more compres- 
sion appears feasible but we leave this for future work. 


4 Evaluation 


We now evaluate the performance and accuracy of DBM 
and EXPB to show that they fulfill our goal of a tool 
that efficiently utilizes memory and processing resources 
to faithfully capture and display key bandwidth mea- 
sures. We will show that DBM and EXPB use significantly 
fewer resources than packet tracing and are suitable for 
network-wide measurement and visualization. 


4.1 Measurement Accuracy 


We implemented EXPB and the three variants of DBM as 
user-space programs and evaluated them with real traffic 
traces. Our traces consisted of a packets captured from 
the 1 Gigabit switch that connects several infrastructure 
servers used by the Systems and Networking group at 
U.C. San Diego, and socket-level send data produced by 
the record-breaking TritonSort sorting cluster [17]. 

Our “rsync” trace captured individual packets from 
an 11-hour period during which our NFS server ran its 
monthly backup to a remote machine using rsync. This 
trace recorded the transfer of 76.2 GB of data in 60.6 
million packets, of which 66.6 GB was due to the backup 
operation. The average throughput was 15.4 Mbps with 
a maximum of 782 Mbps for a single second. 

The “tritonsort” trace contains time-stamped byte 
counts from successful send system calls on a single 
host during the sorting of 500 GB of data using 23 nodes 
connected by a 10 Gbps network. This trace contains an 
average of 92,488 send events per second, with a peak of 
123,322 events recorded in a single 1-second interval. In 
total, 20.8 GB were transferred over 34.24 seconds for 
an average throughput of 4.9 Gbps. 

Ideally, our evaluation would include traffic from a 
mix of production applications running over a 10 Gbps 
network. While we do not have access to such a deploy- 
ment, our traces provide insight into how DBM and EXPB 
might perform given the high bandwidth and network uti- 
lization of the “tritonsort” trace and the large variance in 
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bandwidth from second to second in the “rsync” trace. 

For our accuracy evaluation, we used an aggregation 
period of 2 seconds. To avoid problems with incom- 
plete sampling periods in EXPB, we must choose our 
time scales such that they all evenly divide our aggre- 
gation period. Since the prime factors of 2 seconds in 
nsec are 2'! and 51° nsec, EXPB can use up to 11 buck- 
ets. Thus for EXPB, we choose the finest time scale to 
be A = 78.125 ps (5° nsec) and the coarsest to be A = 
80 msec (2!'5% nsec), which is consistent with the time 
scales for interesting bursts in data centers. For consis- 
tency, we also configure DBM to use a base sampling in- 
terval of 78.125 jus, but note that it can answer queries 
up to A = 2 seconds. 

To provide a baseline measurement, we computed 
bandwidth statistics for all of our traces at various time 
scales where A > 78.125 pus. To ensure that all measure- 
ments in S', are equal, we only evaluated time scales that 
evenly divided 2 seconds. In total, this provided us with 
ground-truth statistics at 52 different time scales rang- 
ing from 78.125 jus to 2 seconds. In the following sec- 
tions we report accuracy in terms of error relative to these 
ground-truth measurements. While any number of val- 
ues could be used for A and 7’ in practice, we used these 
values across our experiments for the sake of a consistent 
and representative evaluation between algorithms. 


4.1.1 Accuracy vs. Memory 


We begin by investigating the tradeoff between memory 
and accuracy. At one extreme, SNMP can calculate av- 
erage bandwidth using only a single counter. In contrast, 
packet tracing with tcpdump can calculate a wide range 
of statistics with perfect accuracy, but with storage cost 
scaling linearly with the number of packets. Both DBM 
and EXPB provide a tradeoff between these two extremes 
by supporting complex queries with bounded error, but 
with orders of magnitude less memory. 

For comparison, consider the simplest event trace 
which captures a 64-bit timestamp and a 16-bit byte 
length for each packet sent or received. Using this data, 
one could calculate bandwidth statistics for the trace with 
perfect accuracy at a memory cost of 6 bytes per event. 
In contrast, DBM and EXPB require 8 and 16 bytes of stor- 
age per bucket used, respectively, along with a few bytes 
of meta data for each aggregation period. 

To quantify these differences, we queried our traces 
for max, standard deviation, and 95th percentile (DBM 
only). For each statistic, we compute the average rela- 
tive error of the measurements at each of our reference 
time scales and report the worst-case. To avoid spurious 
errors due to low sample counts, we omit accuracy data 
for standard deviations with fewer than 10 samples per 
aggregation period and 95th percentiles with fewer than 
20 samples per aggregation period. We show the tradeoff 
between storage and accuracy in Table 1. 





Max of Avg. Rel. Error 
Max | S.Dev. | 95th 


(avg) | 9.2 KBps 
pet | 86K | os | om | 
DeM-mr | 4KBps | 7.6% 


Table 2: We repeated our evaluation with the “rsync” 
trace and report accuracy results for our two best per- 
forming algorithms — DBM-mr and EXPB. We calcu- 
lated the average relative error for each of our reference 
time scale and show the worst case. 


| wp 


trace 


While the simple packet trace gives perfectly accu- 
rate statistics, both DBM and EXPB consume memory at 
a fixed rate which can be configured by specifying the 
number of buckets and the aggregation period. In the 
presented configuration, both DBM and EXPB generate 4 
KBps and 96 Bps, respectively — orders of magnitude 
less memory than the simple trace. 

The cost of reduced storage overhead in DBM and 
E.XPB 1s the error introduced in our measurements. How- 
ever, we see that the range of average relative error rates 
is reasonable for max, standard deviation, and 95th per- 
centile measurements. Further, of the DBM algorithms, 
DBM-mr gives the lowest errors throughout. While not 
shown, DBM’s errors are largely due to under-estimation, 
but its accuracy improves as the query interval grows. 
EXPB gives consistent estimation errors for max across 
all of our reference points, but gradually degrades for 
standard deviation estimates as query intervals increase. 
Thus, for this trace, EXPB achieves the lowest error for 
query intervals less than 2msec. We have divided Table | 
to show the worst-case errors in these regions. 

In Table 2 , we show the accuracy of DBM—mr and 
EXPB when run on the “rsync” trace with the same pa- 
rameters as before. We note that again DBM-mr gives 
the most accurate results for larger query intervals, but 
now out-performs EXPB for query intervals greater than 
160s for max and Imsec for standard deviation. 

To see the effect of scaling the number of buckets, 
we picked a representative query interval of 400 jus and 
investigated the accuracy of DBM—mr as the number of 
buckets were varied. The results of measuring the max, 
standard deviation and 95th percentile on the “tritonsort” 
trace are shown in Figure 4. We see that the relative error 
for all measurements decreases as the number of buck- 
ets is increased. However, at 4,000 buckets the curves 
flatten significantly and additional buckets beyond this 
do not produce any significant improvement in accuracy. 
While one might expect the error to drop to zero when 
the number of buckets is equal to the number of samples 
at S~ (5000 samples for 400s), we do not see this since 
the trace is sampled at a finer granularity (78.125 ws) and 
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| eats 


packet trace 


(avg) 
(peak) 
DBM-mm, 1000 buckets 
DBM-mv, 1000 buckets 
DBM-mr, 1000 buckets 





Max of Avg. Relative Error 
< 2 msec > 2msec 
S.Dev. 95th 


555 KBps 
740 KBps 0% 0% O% 0% O% O% 


2.2% 
1,29 
2.0% 


EXPE, II buckets | _96Bps | 27% | 25% | NIA || 28% | 81% | NIA 


Table 1: Memory vs. Accuracy. We evaluate the “tritonsort’” trace with a base time scale of A =78.125 ps and a 2 
second aggregation period. Data output rate is reported for a simple packet trace compared with the DBM and EXPB 
algorithms. For each statistic, we compute the max of the average relative error of measurements for each of our 


reference time scales. 
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Figure 4: Relative error for DBM—mr algorithm shown for the 400 js time scale with a varying number of buckets. The 
box plots show the range of relative errors from the 25th to 75th percentiles, with the median indicated in between. 


The box whiskers indicate the min and max errors. 


the buckets are merged online. There is no guarantee that 
DBM will merge the buckets such that each spans exactly 
400s of the trace. 

With approximations of the max and standard devia- 
tion with this degree of accuracy, we see both DBM and 
EXPB as an excellent, low-overhead alternative to packet 
tracing. 


4.1.2 DBM Visualization 


One unique property of the DBM algorithms is that they 
can be visualized to show users the shape of the band- 
width curves. Note that we proved earlier that DBM—mr 
is optimal in some sense in picking out bursts. We now 
investigate experimentally how all DBM variants do in 
burst detection. 

In Figures 5 we show the output for a single, 2 sec- 
ond aggregation period from the “rsync” trace using 
DBM-mr. For visual clarity, we configured DBM—mr to 
aggregate measurements at a 4 msec base time scale (250 
data points) using 9 buckets. Figure 5 shows the raw 
data points (bandwidth use in each 4 msec interval of 
the 2 second trace) with the DBM—mr output superim- 
posed. Notice that DBM-mr picks out four bursts (the 
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vertical lines). The fourth burst looks smaller than the 
3.1 Mbps burst observable in the raw trace. This 1s be- 
cause there were two adjacent measurement intervals in 
the raw trace with bandwidths of 3.1 and 2.2 Mbps, re- 
spectively. DBM—mr merged these measurements into a 
single bucket of with an average bandwidth of 2.65 Mbps 
for 8 msec. 

We show the output for all DBM algorithms in a more 
clean visual form in Figures 6a, 6b and 6c. We have 
normalized the width of the buckets and list their start 
and end times on the x-axis. Additionally, we label each 
bucket with its mass (byte count). This representation 
compresses periods of low traffic and highlights short- 
lived, high-bandwidth events. From the visualization of 
DBM-mr in Figure 6c, we can quickly see that there were 
four periods of time, each lasting between 4 and 8 msec 
where the bandwidth exceeded 2.3 Mbps. Note that in 
Figure 6a, DBM—mm picks out only two bursts. The re- 
maining bursts have been merged into the three buckets 
spanning the period from 1440 to 1636 msec, thereby re- 
ducing the bandwidth (the y-axis) because the total time 
of the combined bucket increases. 

In practice, a network administrator might want to 
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Figure 5: Visualization of events from a 2 second aggre- 
gation period overlaid with the output of DBM—mr using 
9 buckets and a 4 msec measurement time scale. 


quickly scan such a visualization and look for microburst 
events. To simulate such a scenario, we randomly in- 
serted three bursts, each lasting 4 msec and transmitting 
between 4.0 and 4.4 MB of data. We show the DBM visu- 
alization for this augmented trace in bottom of Figure 6. 
DBM-mr and DBM-mm both allocate their memory re- 
sources to capture all three of these important events, 
even though they only represent 12 msec of a 2 second 
aggregation period. Again, DBM—mr cleanly picks out 
the three bursts. 


4.1.3 Accuracy at High Load 


As mentioned in Lemma 1, the error associated with the 
DBM algorithms increases with the ratio of total packet 
mass (total bytes) to number of buckets within an ag- 
gregation period. We now investigate to what extent 1n- 
creasing the mass within an aggregation period affects 
the measurement accuracy of DBM. To evaluate this, we 
first configured DBM to use a base time scale of A = 
78.125 ps and 1000 buckets, as before, but vary the 
mass stored in DBM by changing the aggregation period. 
Figures 7a & 7b show the change in average relative 
error for both max and standard deviation statistics in 
our high-bandwidth “tritonsort” trace at a representative 
query time scale (400 js) as the aggregation period is 
varied between | and 16 seconds. 

For DBM—mm and DBM—mv with 1000 buckets the rela- 
tive error diverges significantly as the aggregation period 
is increased. In contrast, DBM—mr shows only a subtle 
degradation for max from 5.9% to 12.3%. For standard 
deviation, DBM—mv show consistently poor performance 
with average relative errors increasing from 32% to 64%, 
while both DBM—mm and DBM-mr trend together with 
DBM-mr’s errors ranging from 9.8 to 31.7%. 

We contrast DBM—mr’s performance for these exper- 
iments with that of EXPB. We see that EXPB’s average 
relative error in the max measurement gradually falls 


from 2.8% to 1.9% as the aggregation period increases. 
Further, the error in standard deviation falls from 1.4% 
at a 1 second aggregation period to 0.5% at 16 seconds. 

These results indicate that degradation in accuracy 
does occur as the ratio of the total packet mass to bucket 
count increases, as predicted by Lemma |. While DBM 
must be configured correctly to bound the ratio of packet 
mass to bucket count, EXPB’s accuracy is largely unaf- 
fected by the packet mass or aggregation period. 


4.2 Performance Overhead 


As previously stated, we seek to provide an efficient al- 
ternative to packet capture tools. Hence we compare 
the performance overhead of DBM and EXPB to that of 
an unmodified vanilla kernel, and to the well-established 
tcpdump[5]. 

We implemented our algorithms in the Linux 2.6.34 
kernel along with a userspace program to read the cap- 
tured statistics and write them to disk. To provide greater 
computational efficiency we constrained the base time 
scale and the aggregation period to be powers of 2. The 
following experiments were run on 2.27 GHz, quad-core 
Intel Xeon servers with 24 GB of memory. Each server 
is connected to a top-of-rack switch via 10 Gbps ethernet 
and has a round trip latency of approximately 100 ps. 

To quantify the impact of our monitoring on perfor- 
mance, we first ran iperf [3] to send TCP traffic between 
two machines on our 10 Gbps network for 10 seconds. 
In addition, we instrumented our code to report the time 
spent in our routines during the test. We first ran the 
vanilla kernel source, then added different versions of 
our monitoring to aggregate 64 us intervals over 1 sec- 
ond periods. We report both the bandwidth achieved by 
iperf and the average latency added to each packet at the 
sending server in Table 3. For comparison, we also re- 
port performance numbers for tcpdump when run with 
the default settings and writing the TCP and IP head- 
ers (52 bytes) of each packet directly to local disk. As 
DBM-mzr is nearly identical to DBM—mm with respect to 
implementation, we omit DBM—mr’s results. 

As discussed in section 3.1, we see that the latency 
overhead per packet increases roughly as the log of the 
number of buckets. However, iperf’s maximum through- 
put is not degraded by the latency added to each packet. 
Since the added latency per packet is several orders of 
magnitude less than the RTT, the overhead of DBM should 
not affect TCP’s ability to quickly grow its congestion 
window. In contrast to DBM, tcpdump achieves 3.5% 
less throughput. 

To observe the overhead of our monitoring on an ap- 
plication, we transferred a IGB file using scp. We 
measured the wall-clock time necessary to complete the 
transfer by running scp within the Linux’s time utility. 
To quantify the affects of our measurement on the to- 
tal completion time, we measured the total overhead im- 
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(a) DBM—mm visualization 
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(d) DBM-mm visualization of bursty traffic 
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(b) DBM-mv visualization 
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(e) DBM-mv visualization of bursty traffic 
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(c) DBM-mr visualization 
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(f) DBM—mr visualization of bursty traffic 


Figure 6: Visualization of DBM with 9 buckets over a single 2 second aggregation period. The start and end times for 
each bucket are shown on the x-axis, and each bucket is labeled with its mass (byte count). The top figures show the 
various DBM approximations of a single aggregation period, while the lower graphs show the same period with three 


short-lived, high bandwidth bursts randomly inserted. 


posed on packets as they moved up and down the network 
stack. We report this overhead as a percentage of each 
experiment’s average completion time (monitoring time 
divided by scp completion time). Each experiment was 
replicated 60 times and results are reported in Table 4. 
We see that although the cumulative overhead added by 
DBM grows logarithmically with the number of buckets, 
the time for scp to complete increases by at most 4.5%. 

We see that our implementations of DBM and EXPB 
have a negligible impact on application performance, 
even while monitoring traffic at 10 Gbps. 


4.3. Evaluation Summary 


Our experiments indicate DBM—mr consistently provides 
better burst detection and has reasonable average case 
and worst case error for various statistics. When mea- 
suring at arbitrary time scales, EXPB has comparable or 
better average and worst-case error than DBM while us- 
ing less memory. In addition, EXPB is unaffected by 
high mass in a given aggregation period. On other hand, 
DBM can approximate time series, which is useful for see- 
ing how burst are distributed in time and for calculating 
more advanced statistics (i.e. percentiles). We recom- 
mend a parallel implementation where EXPB is used for 
Max and Standard Deviation and DBM-mr 1s used for all 
other queries. 
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5 System Implications 


So far, we have described DBM and EXPB as part of an 
end host monitoring tool that can aggregate and visual- 
ize bandwidth data with good accuracy. However, we see 
these algorithms a part of a larger infrastructure monitor- 
ing system. 


Long-term Archival and Database Support _ It is use- 
ful for administrators to retrospectively troubleshoot 
problems that are reported by customers days after the 
fact. At slightly more than 4 KBps, the data produced by 
both DBM and EXPB for a week (2.4 GB per link) could 
easily be stored to a commodity disk. With this data, 
an administrator can pinpoint traffic abnormalities at mi- 
crosecond timescales and look for patterns across links. 
The data can be compacted for larger time scales by re- 
ducing granularity for older data. For example, one hour 
of EXPB data could be collapsed into one set of buckets 
containing max and standard deviation information at the 
original resolutions but aggregated across the hour. 

With such techniques, fine-grain network statistics for 
hundreds of links over an entire year could be stored to a 
single server. The data could be keyed by link and time 
and stored in a relational database to allow queries across 
time (is the traffic on a single link becoming more bursty 
with time?) or across links (did a number of bursts cor- 
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Figure 7: Average relative error for the DBM with 1000 buckets and EXPB with 11 buckets shown on the “tritonsort” 
trace for a 400 jus query interval and various aggregation periods. 
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14.765 sec 
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560.4 nsec 
205.7 nsec 
327.9 nsec 
432.2 nsec 


14.527 sec 
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14.344 sec 
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457.2 nsec 


9.067 Gbps 


9.109 Gbps | 169.4 nsec 
E732 Gbps 


Table 3: Average TCP bandwidth reported by iperf over 
60 10-second runs. We also show the average time spent 
in the kernel-level monitoring functions for each packet 
sent. DBM and EXPB were run with a base time scale of 
A = 64 ps and T= | second aggregation period. 


relate on multiple switch input ports’). 


Hardware Implementation Both DBM and EXPB al- 
gorithms can be implemented in hardware for use in 
switches and routers. EXPB has an amortized cost of two 
bucket updates per measurement interval. Since bucket 
updates are only needed at the frequency of the measure- 
ment time scale, these operations could be put on a work 
queue and serviced asynchronously from the main packet 
pipeline. The key complication for implementing DBM 
in hardware is maintaining a binary heap. However, a 
1000 bucket heap can be maintained in hardware using 
a 2-level radix-32 heap that uses 32-way comparators at 
10 Gbps. Higher bucket sizes and speeds will require 
pipelining the heap. The extra hardware overhead for 
these algorithms in gates is minimal. Finally, the log- 





14.482 sec 


Table 4: The time needed to transfer a 1GB file over scp. 
We measured the cumulative overhead incurred by our 
monitoring routines for all send and receive events. We 
report this overhead as a percentage of each experiment’s 
total running time. 


ging overhead is very small, especially when compared 
to NetFlow. 


6 Conclusions 


Picking out bursts in a large amount of resource usage 
data is a fundamental problem and applies to all re- 
sources, whether power, cooling, bandwidth, memory, 
CPU, or even financial markets. However, in the do- 
main of data center networks, the increase of network 
speeds beyond | Gigabit per second and the decrease of 
in-network buffering has made the problem one of great 
interest. 

Managers today have little information about how mi- 
crobursts are caused. In some cases they have identi- 
fied paradigms such as InCast, but managers need better 
visibility into bandwidth usage and the perpetrators of 
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microbursts. They would also like better understanding 
of the temporal dynamics of such bursts. For instance, 
do they happen occasionally or often? Do bursts linger 
below a tipping point for a long period or do they arise 
suddenly like tsunamis? Further, correlated bursts across 
links lead to packet drops. A database of bandwidth in- 
formation from across an administrative domain would 
be valuable in identifying such patterns. Of course, this 
could be done by logging a record for every packet, but 
this is too expensive to contemplate today. 

Our paper provides the first step to realizing such a vi- 
sion for a cheap network-wide bandwidth usage database 
by showing efficient summarization techniques at links 
(~4 KB per second, for example, for running DBM and 
EXPB on 10 Gbps links) that can feed a database backend 
as shown in Figure 1. Ideally, this can be supplemented 
by algorithms that also identify the flows responsible for 
bursts and techniques to join information across multiple 
links to detect offending applications and their timing. 
Of the two algorithms we introduce, Exponential Bucket- 
ing offers accurate measurement of the average, max and 
standard deviation of bandwidths at arbitrary sampling 
resolutions with very low memory. In contrast, Dynamic 
Bucket Merge approximates a time-series of bandwidth 
measurements that can visualized or used to compute ad- 
vanced statistics, such as quantiles. 

While we have shown the application of DBM and 
EXPB to bandwidth measurements in endhosts, these al- 
gorithms could be easily ported to in-network monitor- 
ing devices or switches. Further, these algorithms can be 
generally applied to any time-series data, and will be par- 
ticularly useful in environments where resource spikes 
must be detected at fine time scales but logging through- 
put and archival memory is constrained. 
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Abstract 


In this paper, we design, implement, and evaluate a 
new scalable and fault tolerant network manager, called 
ETTM, for securely and efficiently managing network 
resources at a packet granularity. Our aim is to pro- 
vide network administrators a greater degree of control 
over network behavior at lower cost, and network users a 
greater degree of performance, reliability, and flexibility, 
than existing solutions. In our system, network resources 
are managed via software running in trusted execution 
environments on participating end-hosts. Although the 
software is physically running on end-hosts, it is logi- 
cally controlled centrally by the network administrator. 
Our approach leverages the trend to open management 
interfaces on network switches as well as trusted com- 
puting hardware and multicores at end-hosts. We show 
that functionality that seemingly must be implemented 
inside the network, such as network address translation 
and priority allocation of access link bandwidth, can be 
simply and efficiently implemented in our system. 


1 Introduction 


In this paper, we propose, implement, and evaluate a new 
approach to the design of a scalable, fault tolerant net- 
work manager. Our target is enterprise-scale networks 
with common administrative control over most of the 
hardware on the network, but with complex quality of 
service and security requirements. For these networks, 
we provide a uniform administrative and programming 
interface to control network traffic at a packet granular- 
ity, implemented efficiently by exploiting trends in PC 
and network switch hardware design. Our aim is to pro- 
vide network administrators a greater degree of control 
over network behavior at lower cost, and network users a 
greater degree of performance, reliability, and flexibility, 
compared to existing solutions. 

Network management today is a difficult and complex 
endeavor. Although IP, Ethernet and 802.11 are widely 
available standards, most network administrators need 
more control over network behavior than those proto- 
cols provide, in terms of security configuration [21, 14], 
resource isolation and prioritization [36], performance 
and cost optimization [4], mobility support [22], prob- 
lem diagnosis [27], and reconfigurability [7]. While most 
end-host operating systems have interfaces for configur- 
ing certain limited aspects of network security and re- 
source policy, these configurations are typically set inde- 
pendently by each user and therefore provide little assur- 


ance or consistent behavior when composed across mul- 
tiple users on a network. 


Instead, most network administrators turn to middle- 
boxes - a central point of control at the edge of the net- 
work where functionality can be added and enforced on 
all users. Unfortunately, middleboxes are neither a com- 
plete nor a cost-efficient solution. Middleboxes are usu- 
ally specialized appliances designed for a specific pur- 
pose, such as a firewall, packet shaper, or intrusion de- 
tection system, each with their own management inter- 
face and interoperability issues. Middleboxes are typi- 
cally deployed at the edge of the (local area) network, 
providing no help to network administrators attempting 
to control behavior inside the network. Although middle- 
box functionality could conceivably be integrated with 
every network switch, doing so is not feasible at line-rate 
at reasonable cost with today’s LAN switch hardware. 

We propose a more direct approach, to manage net- 
work resources via software running in trusted execution 
environments on participating endpoints. Although the 
software is physically running on endpoints, it is logi- 
cally controlled centrally by the network administrator. 
We somewhat whimsically call our approach ETTM, or 
End to the Middle. Of course, there is still a middle, 
to validate the trusted computing stack running on each 
participating node, and to redirect traffic originating from 
non-participating nodes such as smart phones and print- 
ers to a trusted intermediary on the network. By moving 
packet processing to trusted endpoints, we can enable a 
much wider variety of network management functional- 
ity than is possible with today’s network-based solutions. 

Our approach leverages four separate hardware and 
software trends. First, network switches increasingly 
have the ability to re-route or filter traffic under admin- 
istrator control [7, 30]. This functionality was origi- 
nally added for distributed access control, e.g., to pre- 
vent visitors from connecting to the local file server. We 
use these new-generation switches as a lever to a more 
general, fine-grained network control model, e.g., allow- 
ing us to efficiently interpose trusted network manage- 
ment software on every packet. Second, we observe 
that many end-host computers today are equipped with 
trusted computing hardware, to validate that the endpoint 
is booted with an uncorrupted software stack. This al- 
lows us to use software running on endpoints, and not 
just network hardware in the middle of the network, as 
part of our enforcement mechanism for network man- 
agement. Third, we leverage virtual machines. Our 
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network management software runs in a trusted virtual 
machine which is logically interposed on each network 
packet by a hypervisor. Despite this, to the user each 
computer looks like a normal, completely configurable 
local PC running a standard operating system. Users can 
have complete administrative control over this OS with- 
out compromising the interposition engine. Finally, the 
rise of multicore architectures means that it is possible 
to interpose trusted packet processing on every incom- 
ing/outgoing packet without a significant performance 
degradation to the rest of the activity on a computer. 

In essence, we advocate converting today’s closed ap- 
pliance model of network management to an open soft- 
ware model with a standard API. None of the function- 
ality we need to implement on top of this API is par- 
ticularly complex. As a motivating example, consider a 
network administrator needing to set up a computer lab 
at a university in a developing country with an underpro- 
visioned, high latency link to the Internet. It is well un- 
derstood that standard TCP performance will be dread- 
ful unless steps are taken to manipulate TCP windows 
to limit the rate of incoming traffic to the bandwidth of 
the access link, to cache repeated content locally, and to 
prioritize interactive traffic over large background trans- 
fers. As another example, consider an enterprise seek- 
ing to detect and combat worm traffic inside their net- 
work. Current Deep Packet Inspection (DPI) techniques 
can detect worms given appropriate visibility, but are ex- 
pensive to deploy pervasively and at scale. We show that 
it is possible to solve these issues in software, efficiently, 
scalably, and with high fault tolerance, avoiding the need 
for expensive and proprietary hardware solutions. 

The rest of the paper discusses these issues in more 
detail. We describe our design in § 2, sketch the network 
services which we have built in § 3, summarize related 
work in § 4 and conclude in 8 5. 


2 Design & Prototype 


ETTM is a scalable and fault-tolerant system designed to 
provide a reliable, trustworthy and standardized software 
platform on which to build network management ser- 
vices without the need for specialized hardware. How- 
ever, this approach begs several questions concerning se- 
curity, reliability and extensibility. 

e How can network management tasks be entrusted to 
commodity end hosts which are notorious for being 
insecure? In our model, network management tasks 
can be relocated to any trusted execution environment 
on the network. This requires the network manage- 
ment software be verified and isolated from the host 
OS to be protected from compromise. 

e If the management tasks are decentralized, how can 
these distributed points of control provide consistent 
decisions which survive failures and disconnections? 
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Figure 1: The architecture of an ETTM end-host. Network 
management services run in a trusted virtual machine (AEE). 
Application flows are routed to appropriate network manage- 
ment services using a micro virtual router (uvrouter). 


The system should not break simply because a user, 
or a whole team of users, turn off their computers. 
In particular, management services must be available 
in face of node failures and maintain consistent state 
regarding the resources they manage. 


e How can we architect an extensible system that en- 
ables the deployment of new network management 
services which can interpose on relevant packets? 
Network administrators need a single interface to 1n- 
stall, configure and compose new network manage- 
ment services. Further, the implementation of the in- 
terface should not impose undue overheads on net- 
work traffic. 


While many of the techniques we employ to surmount 
these challenges are well-known, their combination into 
a unified platform able to support a diverse set of net- 
work services is novel. The particular mechanisms we 
employ are summarized in Table 1, and the architecture 
of a given end-host participating in management can be 
seen in Figure 1. 


The function of these mechanisms is perhaps best il- 
lustrated by example, so let us consider a distributed Net- 
work Address Translation (NAT) service for sharing a 
single IP address among a set of hosts. The NAT service 
in ETTM maps globally visible port numbers to private 
IP addresses and vice versa. First, the translation table 
itself needs to be consistent and survive faults, so it is 
maintained and modified consistently by the consensus 
subsystem based on the Paxos distributed coordination 
algorithm. Second, the translator must be able to inter- 
pose on all traffic that is either entering or leaving the 
NATed network. The micro virtual router (vrouter)’s 
filters allow for this interposition on packets sourced by a 
ETTM end-host, while the physical switches are set up to 
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Trusted Authoriza- 


tion tocol to authorize trusted stacks 


Attested Execution | Trusted space to run filters and control plane pro- 
cesses on untrusted end-hosts 

Physical Switches | In-network enforcers of access control and rout- 
— | ing/switching policy decisions 


Filters End-host enforcers of network policy running inside | Multicore 
FEMS | neatesed ecwiontionment fn 
Consensus Agreement on management decisions and shared | Fault tolerance | Reliability, 25 
aa nn 


Table 1: Summary of mechanisms in ETTM. 


deliver incoming packets to the appropriate host.! Lastly, 
because potentially untrusted hosts will be involved in 
the processing of each packet, the service is run only in 
an isolated attested execution environment on hosts that 
have been verified using our trusted authorization proto- 
col based on commodity trusted hardware. 


2.1 Trusted Authorization 


Traditionally, end-hosts running commodity operating 
systems have been considered too insecure to be en- 
trusted with the management of network resources. 
However, the recent proliferation of trusted computing 
hardware has opened the possibility of restructuring the 
placement of trust. In particular, using the trusted plat- 
form module (TPM) [39] shipping with many current 
computers, it is possible to verify that a remote com- 
puter booted a particular software stack. In ETTM, we 
use this feature to build an extension to the widely-used 
802.1 X network access control protocol to make autho- 
rization decisions based on the booted software stack of 
end-hosts rather than using traditional key- or password- 
based techniques. We note that the guarantees provided 
by trusted computing hardware generally assume that an 
attacker will not physically tamper with the host, and we 
make this assumption as well. 

The remainder of this section describes the particular 
capabilities of current trusted hardware and how they en- 
able the remote verification of a given software stack. 


2.1.1 Trusted Platform Module 


The TPM is a hardware chip commonly found on moth- 
erboards today consisting of a cryptographic processor, 
some persistent memory, and some volatile memory. The 
TPM has a wide variety of capabilities including the se- 
cure storage of integrity measurements, RSA key cre- 
ation and storage, RSA encryption and decryption of 
data, pseudo-random number generation and attestation 
to portions of the TPM state. Much of this functionality 


'This is possible with legacy ethernet switches using a form of de- 
tour routing or more efficiently with programmable switches [30]. 


Extension to the 802.1X network access control pro- 





Virtualization, Scalability 22 
Multicore 
Extensibility 


is orthogonal to the purposes of this paper. Instead, we 
focus on the features required to remotely verify that a 
machine has booted a given software stack. 


One of the keys stored in the TPM’s persistent memory 
is the endorsement key (EK). The EK serves as an iden- 
tity for the particular TPM and is immutable. Ideally, the 
EK also comes with a certificate from the manufacturer 
stating that the EK belongs to a valid hardware TPM. 
However many TPMs do not ship with an EK from the 
manufacturer. Instead, the EK 1s set as part of initializing 
the TPM for its first use. 


The volatile memory inside the TPM is reset on ev- 
ery boot. It is used to store measurement data as well 
as any currently loaded keys. Integrity measurements of 
the various parts of the software stack are stored in regis- 
ters called Platform Configuration Registers (PCRs). All 
PCR values start as 0 at boot and can only be changed 
by an extend operation, 1.e., it is not possible to replace 
the value stored in the PCR with an arbitrary new value. 
Instead, the extend operation takes the old value of the 
PCR register, concatenates it with a new value, computes 
their hash using Secure Hash Algorithm 1 (SHA-1), and 
replaces the current value in the PCR with the output of 
the hash operation. 


2.1.2 Trusted Boot 


The intent is that as the system boots, each software com- 
ponent will be hashed and its hash will be used to ex- 
tend at least one of the PCRs. Thus, after booting, the 
PCRs will provide a tamper evident summary of what 
happened during the boot. For instance, the post-boot 
PCR values can be compared against ones corresponding 
to a known-good boot to establish if a certain software 
stack has been loaded or not. 


To properly measure all of the relevant components 
in the software stack requires that each layer be instru- 
mented to measure the integrity of the next layer, and 
then store that measurement in a PCR before passing ex- 
ecution on. Storing measurements of different compo- 
nents into different PCRs allows individual modules to 
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be replaced independently. 

As each measurement’s validity depends on the cor- 
rectness of the measuring component, the PCRs form a 
chain of trust that must be rooted somewhere. This root 
is the immutable boot block code in the BIOS and is 
referred to as the Core Root of Trust for Measurement 
(CRTM). The CRTM measures itself as well as the rest 
of BIOS and appends the value into a PCR before pass- 
ing control to any software or firmware. This means that 
any changeable code will not acquire a blank PCR state 
and cannot forge being the “bottom” of the stack. 

It should be noted that the values in the PCRs are only 
representative of the state of the machine at boot time. 
If malicious software is loaded or changes are made to 
the system thereafter, the changes will not be reflected 
in the PCRs until the machine is rebooted. Thus, it is 
important that only minimal software layers are attested. 
In our case, we attest the BIOS, boot loader, virtual ma- 
chine monitor, and execution environment for network 
services. We do not need to attest the guest OS running 
on the device, as it is never given access to the raw pack- 
ets traversing the device. 


2.1.3 Attestation 


Once a machine is booted with PCR values in the TPM, 
we need a verifiable way to extract them from the TPM 
so that a remote third party can verify that they match 
a known-good software stack and that they came from 
a real TPM. In theory this should be as simple as sign- 
ing the current PCR values with the private half of the 
EK, but signing data with the EK directly is disallowed.? 
Instead, Attestation Identity Keys (AIKs) are created to 
sign data and create attestations. The AIKs can be as- 
sociated with the TPM’s EK either via a Privacy CA or 
via Direct Anonymous Attestation [39] in order to prove 
that the AIKs belong to a real TPM. As a detail, because 
many TPMs do not ship with EKs from their manufactur- 
ers, these computers must generate an AIK at installation 
and store the public half in a persistent database. 

To facilitate attestation, TPMs provide a quote opera- 
tion which takes a nonce and signs a digest of the current 
PCRs and that nonce with a given AIK. Thus, a verifier 
can challenge a TPM-equipped computer with a random, 
fresh nonce and validate that the response comes from a 
known-good AIK, contains the fresh nonce, and repre- 
sents a known-good software stack. 


2.1.4 ETTM Boot 


When a machine attempts to connect to an ETTM net- 
work, the switch forwards the packets to a verification 
server which can be either an already-booted end-host 
running ETTM, a persistent server on the LAN or even 


*This is to avoid creating an immutable identity which is revealed 
in every interaction involving the TPM. 
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Figure 2: The steps required for an ETTM boot and trusted 
authorization. 
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a cloud service.’ On recognizing the connection of a 


new host, the switch establishes a tunnel to the verifica- 
tion server and maintains this tunnel until the verification 
server can reach a verdict about authorization. 

If the host is verified as running a complete, trusted 
software stack then it is simply granted access to the 
network. If the host is running either an incomplete or 
old software stack, the ETTM software on the end-host 
attempts to download a fresh copy and retries. Traffic 
from non-conformant hosts are tunneled to a participat- 
ing host; our design assumes this 1s a rare case. 

Our trusted authorization protocol creates this ex- 
change via an extension to the 802.1X and EAP proto- 
cols. We have extended the wpa_supplicant [28] 
802.1X client and the FreeRADIUS [16] 802.1X server 
to support this extension and provide authorization to 
clients based purely on their attested software stacks. 

This process is shown in Figure 2. First, the end- 
host connects to an ETTM switch, receives an EAP Re- 
quest Identity packet (1), and responds with an EAP 
Response/Identity frame containing the desired AIK to 
use (2). The switch encapsulates this response inside 
an 802.1x packet which is forwarded to the verifica- 
tion server running our modified version of FreeRA- 
DIUS (3). The FreeRADIUS server responds with a sec- 
ond EAP Request Trusted Software Stack frame contain- 
ing a nonce again encapsulated inside an 802.1X packet 
(4), and the end-host responds with an EAP Response 
Trusted Software Stack frame containing the signed PCR 
values proving the booted software stack (5). This con- 
cludes the verification stage. 

The verification server can then either render a verdict 
as to whether access is granted (7) or require the end-host 
to go through a provisioning stage (6) where extra code 
and/or configuration can be loaded onto the machine and 
the authorization retried. 


2.1.5 Performance of ETTM Boot 


Table 2 presents microbenchmarks for various TPM op- 
erations (including those which will be described later in 
this section) on our Dell Latitude e5400 with a Broad- 
com TPM complying to version 1.2 of the TPM spec, an 
Intel 2 GHz Core 2 Duo processor and 2 GB of RAM. 


>We assume the existence of some persistently reachable computer 
to bootstrap new nodes and store TPM configuration state. Under nor- 
mal operation, this is a currently active verified ETTM node. 


USENIX Association 


USENIX Association 


Std. Dev. ©) 
PCR Extend | 0.0253 0.001 
Create AIK 


Load AIK 0.002 
Sign PCR | 0.998 0.001 


Table 2: The time (in seconds) it takes for a variety of TPM 
Operations to complete. 


Wall Clock Time () 


client start 
receive first server message 





oo 


receive challenge nonce 
send signed PCRs 
receive server decision 





Table 3: The time (in seconds) it takes for an 802.1X EAP-TSS 
authorization with breakdown by operation. 


The time to create the AIK is needed only once at sys- 
tem initialization. The total time added to the normal 
boot sequence for an ETTM enabled host is negligible 
as most actions can be trivially overlapped with other 
boot tasks. Assuming the challenge nonce is received, 
the signing time can be overlapped with the booting of 
the guest OS as no attestation is required to its state. 

Table 3 shows a breakdown of how long each step 
takes in our implementation of trusted authorization as- 
suming an up-to-date trusted software stack is already in- 
stalled on the end-host and the relevant AIK has already 
been loaded. The total time to verify the boot status is 
just over 1 second. This is dominated by the time that 
it takes to sign the PCR values after having received the 
challenge nonce. 


2.2 Attested Execution Environment 


In ETTM, we require that each participating host has a 
corresponding trusted virtual machine which is responsi- 
ble for managing that host’s traffic. We call this virtual 
machine an Attested Execution Environment (AEE) be- 
cause it has been attested by Trusted Authorization. In 
the common case, this virtual machine will run alongside 
the commodity OS on the host, but in some cases a host’s 
corresponding AEE may run elsewhere with the physical 
switching infrastructure providing an constrained tunnel 
between the host and its remote VM. 

The AEE is the vessel in which network management 
activities take place on end-hosts. It provides three key 
features: a mechanism to intercept all incoming and out- 
going traffic, a secure and isolated execution environ- 
ment for network management tasks and a common plat- 
form for network management applications. 

To interpose the AEE on all network traffic, the hyper- 
visor (our implementation makes use of Xen [3]) is con- 


figured to forward all incoming and outgoing network 
traffic through the AEE. This configuration is verified as 
part of trusted authorization. Once the AEE has been in- 
terposed on all traffic, it can apply the ETTM filters (de- 
scribed in § 2.4) giving each network service the required 
points of visibility and control of the data path. 

Further, the hypervisor is configured to isolate the 
AEE from any other virtual machines it hosts. Thus, the 
AEE will be able to faithfully execute the prescribed fil- 
ters regardless of the configuration of the commodity op- 
erating system. * The AEE can also execute network 
management tasks which are not directly related to the 
host’s traffic. For example, it could redirect traffic to a 
mobile host, verify a new host’s software stack or recon- 
figure physical switches. It is even possible for a host to 
run multiple AEEs simultaneously with some being run 
on behalf of other nodes in the system. A desktop with 
excess processing power can stand-in to filter the traffic 
from a mobile phone. 

Lastly, the AEE provides a common platform to build 
network management services. Because this platform 
is run as a VM, it can remain constant across all end- 
hosts providing a standardized software API. Our cur- 
rent AEE implementation is a stripped-down Linux vir- 
tual machine, however, we have augmented it with APIs 
to manage filters (described in § 2.4) as well as to manage 
reliable, consistent, distributed state (described in § 2.5). 

While in most cases, the added computational re- 
sources required to run an AEE do not pose a problem, 
ETTM allows for AEEs (or some parts of an AEE) to 
be offloaded to another computer. In our prototype, this 
is handled by applications themselves. In the future, we 
hope to add dynamic offloading based on machine load. 


2.3 Physical Switches 


Physical switches are the lowest-level building block in 
ETTM. Their primary purpose is to provide control and 
visibility into the link layer of the network. This includes 
access control, flexible control of packet forwarding, and 
link layer topology monitoring. 


e Authorization/Access Control: As described ear- 
lier, switches redirect and tunnel traffic from as of yet 
unauthorized hosts until an authorization decision has 
been made. 


e Flexible Packet Forwarding: The ability to install 
custom forwarding rules in the network enables sig- 
nificantly more efficient implementations of some 
network management services (e.g., NAT), but is not 
required. Flexible forwarding also enables more ef- 
ficient routing by not constraining traffic within the 


4We make use of a VM other than the root VM (e. g., Dom0 in Xen) 
for the AEE to both maintain independence from any particular hyper- 
visor and to protect any such root VM from misbehaving applications 
in the AEE. 


NSDI 11: 8th USENIX Symposium on Networked Systems Design and Implementation 


89 


90 


traditional ethernet spanning tree protocol. 


e Topology Monitoring: In order to properly manage 
available network resources, end-hosts must be able 
to discover what network resources exist. This in- 
cludes the set of physical switches and links along 
with the links’ latencies and capacities. 


At a minimum, ETTM only requires the first of these 
capabilities and since we implement access control via 
an extension to 802.1X and EAP, most current ethernet 
switches (even many inexpensive home routers [31, 10]) 
can serve as ETTM switches. There are advantages 
to more full-featured switches, however. For instance, 
a physical switch that supports the 802.1AE MACSec 
specification can provide a secure mechanism to differ- 
entiate between the different hosts attached to the same 
physical port and authorize them independently, while 
denying access to other unauthorized hosts attached to 
the port. 

Additionally, ETTM can better manage network re- 
sources when used in conjunction with an OpenFlow 
switch [30]. OpenFlow provides a wealth of network 
status information and supports packet header rewriting 
and flexible, rule-based packet forwarding. We currently 
leave interacting with programmable switches to applica- 
tions. Many applications function correctly using simple 
Ethernet spanning tree routing and do not require con- 
trol over packet-forwarding. Those that do, like the NAT, 
must either implement packet redirection in the applica- 
tion logic by having AEEs forward packets to the ap- 
propriate host or manage configuring the programmable 
switches themselves. We are in the process of creating a 
standard interface to packet forwarding in ETTM. 


2.4 


On each end-host, we construct a lightweight virtual 
router, called the micro virtual router (jvvrouter), which 
mediates access to incoming and outgoing packets by the 
various services. Services use the pvrouter to inspect 
and modify packets as well as insert new packets or drop 
packets. The core idea of filters in ETTM is that they 
are the mechanism to interpose on a per-packet basis and 
their behavior can be controlled by consensus operations 
which occur at a slower time scale: one operation per 
flow or one operation per flow, per RTT. 

The jvrouter consists of an ordered list (by priority) 
of filters which are applied to packets as they depart and 
arrive at the host. The current Filter API is described 
in Table 4. The filters which we have implemented so 
far (described in 8 3) correspond to tasks that would cur- 
rently be carried out by a special-purpose middlebox like 
a NAT, web cache, or traffic shaper. 

The pvrouter is approximately 2250 lines of C++ code 
running on Linux using 1ibipgqand iptables to cap- 
ture traffic. This has simplified development by allowing 


Micro Virtual Router 
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matchOnHeader () 

returns t rue if the filter can match purely on IP, TCP and 
UDP headers (1.e., without considering the payload) and 
false if the filter must match on full packets 
getPriority () 

returns the priority of the filter, this is used to establish the 
order in which filters are applied 


getName () 
simply returns a human readable name of the filter 


matchHeader(iphdr, tcphdr, udphdr) 

returns t rue if the filter is interested in a packet with these 
headers; undefined filters are set to NULL and behavior is 
undefined if mat chOnHeader () returns false 

match (packet) 

returns t rue if the filter is interested in the packet; behavior 
is undefined if mat chOnHeader () returns true 
filter (packet) 

actually processes a packet; returns one of ERROR, 
CONTINUE, SEND, DROP or QUEUED and possibly modi- 
fies the packet 

upkeep () 

this function is called ‘frequently’ and enables the filter to 
perform any maintenance that is required 
getReadyPackets () 

this returns a list of packets that the filter would like to either 
dequeue or introduce; this is called ‘frequently’ 





Table 4: The filter API. 


the zvrouter to run as a user-space application. However, 
the user-space implementation has a downside in that it 
imposes performance overheads that limit the sustained 
throughput for large flows. To address the performance 
concerns, we split the functionality of the jvrouter into 
two components—a user-space module supporting the 
full filter API specified in Table 4 and a kernel-level 
module that supports a more restricted API used only for 
header rewriting and rate-limiting. In applications such 
as the NAT, the user-space filter is invoked only for the 
first packet in order to assign a globally unique port num- 
ber to the flow, while the kernel module is used for filling 
in this port number in subsequent packets. 


The pvrouter enables an administrator to specify a 
stack of filters that carry out the data-plane management 
tasks for the network. That is, it handles traffic that is 
destined for or emanates from an end-host on the net- 
work. Traffic destined to or emanating from AEEs or 
physical switches constitutes the control plane of ETTM 
and is not handled by the filters. 


2.5 Consensus 


If network management is going to be distributed among 
a large number of potentially unreliable commodity com- 
puters, there must be a layer to provide consistency and 
reliability despite failures. For example, a desktop unex- 
pectedly being unplugged should not cause any state to 
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be lost for the remaining functioning computers. Fortu- 
nately, there is a vast literature on how to build reliable 
systems out of inexpensive, unreliable parts. In our case 
we build reliability using the Paxos algorithm for dis- 
tributed consensus [25]. 


We expose a common API which provides a simple 
way for ETTM network services to manage their consis- 
tent state including the ability to define custom rules for 
what state should be semantically allowed and ways to 
choose between liveness and safety in the event that it is 
required. We expose our consensus implementation via 
a table abstraction in which each row corresponds to a 
single service’s state and each cell in a given row corre- 
sponds to an agreed upon action on the state managed by 
the service. Thus, each service has its own independently 
ordered list of agreed upon values, with each row entirely 
independent of other rows from the point of view of the 
Paxos implementation. 


In building the API and its supporting implementation 
we strove to overcome several key challenges: 


Application Independent Agreement: The actual 
agreement process should be entirely independent of the 
particular application. As a consequence, the abstrac- 
tion presented is agreement on an ordered list of blobs of 
bytes for each application or service, with the following 
operations allowed on this ordered list. 


e put (name, value): Attempts to place value 
as acell in the row named name. This will not return 
immediately specifying success or failure, but if the 
value is accepted, a later get call or subscription will 
return value. 


e get (name, segNum): Attempts to retrieve cell 
number seqNum from the row named name. Re- 
turns an error if seqNum is invalid and the relevant 
value otherwise. 


For example, our NAT implementation creates a row in 
the table called “NAT”. When an outgoing connection is 
made an entry is added with the mapping from the private 
IP address and port to the public IP address and a glob- 
ally visible port along with an expiration time. Nodes 
with long-running connections can refresh by appending 
a new entry. Thus, each node participating in the NAT 
can determine the shared state by iteratively processing 
cells from any of the replicas. 


Publish-Subscribe Functionality: A network service 
can subscribe to the set of agreed upon values for a row 
via the subscribe API call. The service running on an 
ETTM node receives a callback (using notify) when 
new values are added to a given row through the put 
API calls. This is useful not just for letting services 
manage their own state, but also for subscribing to spe- 
cial rows that contain information about the network in 


general. For instance, there is one row which describes 
topology information and another row which logs autho- 
rization decisions. The consensus system invokes 


e subscribe(name, seqNum): Asks that the 
values of all cells in the row name starting with the 
cell numbered seqNum be sent to the caller. This 
includes all cells agreed on in the future. 


e unsubscribe (name): Cancels any existing sub- 
scription to the row name. 


e notify(name, value, seqNum): This is the 
callback from a subscription call and lets the 
client know that cell number seqNum of row name 
has the value value. 


Balance Reliability and Performance: Invariably 
adding more nodes and thus increasing expected relia- 
bility causes performance to degrade as more responses 
are required. Thus, we allow for a subset of the partici- 
pating ETTM nodes to form the Paxos group rather than 
the whole set. ETTM nodes use the following API calls 
to join and depart from consensus groups and to identify 
the set of cells that have been agreed upon by the con- 
sensus group. 


@ join(name) Asks the local consensus agent to par- 
ticipate in the row name. 


e leave (name) Asks the local consensus agent to 
stop participating in row name. A graceful ETTM 
machine shutdown involves informing each row that 
the node is leaving beforehand. 


e highestSequenceNumber (name) Returns the 
current highest valid cell number in the row named 
name. 


Allow Application Semantics: While we wish to be ap- 
plication agnostic in the details of agreement, we also 
would like services to be able to enforce some seman- 
tics about what constitute valid and invalid sequences of 
values. Coming back to the NAT example, the seman- 
tic check can ensure that a newly proposed IP-port map- 
ping does not conflict with any previously established 
ones and can even deal with the leased nature of our IP- 
port mappings making the decision once (typically at the 
leader of the Paxos group) as to whether the old lease 
has expired or not. We accomplish this by having net- 
work services optionally provide a function to check the 
validity of each value before it 1s proposed. 


e setSemanticCheckPolicy (name, 
policyhandler): Sets the semantic check 
policy for row name. policyhandler is an 
application-specific call-back function that is used to 
check the validity of the proposed values. 


e check (policyhandler, name, value, 
seqNum): Asks the consensus client if value is 
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a semantically valid value to be put in cell number 
seqNum of row name. Returns true if the value 
is semantically valid, false if it is not and with an 
error if the checker has not been informed of all cells 
preceding cell number segqNum. 


Finally, each row maintained by the consensus sys- 
tem can have a different set of policies about whether 
to check for semantic validity, whether to favor safety or 
liveness (as described below), and even which nodes are 
serving as the set of replicas. 


2.5.1 Catastrophic Failures 


Paxos can make progress only when a majority of the 
nodes are online. If membership changes gradually, the 
Paxos group can elect to modify its membership. The 
two critical parameters that determine the robustness of 
the quorum are the churn rate and the time it takes to 
detect failure and change the group’s membership. The 
consensus group can continue to operate if fewer than 
half of the nodes fail before their failure is detected. In 
such cases, since a majority of the machines in the con- 
sensus group are still operating, we have that set vote on 
any changes necessary to cope with the churn [26]. 

But if a large number of nodes leave simultaneously 
(e.g., because of a power outage), we allow services to 
opt to make progress despite inconsistent state. Each 
service can pick they want to handle this case for its 
row, deciding to either favor liveness or safety via the 
setForkPolicy call. If the row favors safety, then 
the row is effectively frozen until a time when a majority 
of the nodes recover and can continue to make progress. 
However, we allow for a row to favor liveness, in which 
case the surviving nodes make note of the fact that they 
are potentially breaking safety and fork the row. 

Forking effectively creates a new row in which the first 
value is an annotation specifying the row from which it 
was forked off, the last agreed upon sequence number 
before the fork and the new set of nodes which are be- 
lieved to be up. This enables a minority of the nodes to 
continue to make progress. Later on, when a majority of 
the nodes in the original row return to being up, it is up to 
the service to merge the relevant changes (and deal with 
any potential conflicts) from the forked row back into the 
main row via the normal put operation and eventually 
garbage collect the forked row via a delete operation. 
The details of this API are described in Table 5. 

While, in theory, building services that can handle po- 
tentially inconsistent state is hard, we have found that, in 
practice, many services admit reasonable solutions. For 
instance, a NAT which experiences a catastrophic fail- 
ure can continue to operate and when merging conflicts 
it may have to terminate connections if they share the 
same external IP and port, though most of the time there 
will be no such conflicts. 
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setForkPolicy (name, policy) 

Sets the forking policy for the row name in the case of catas- 
trophic failures. The valid values of policy are ‘safe’ and 
‘live’. 


delete (name) 
Cleans up the state associated with row name. Fails if called 


on a row which is not a fork of an already existing row. 
forkNotify (name, forkName) 

Informs the consensus client that because the client asked to 
favor liveness over safety, the row name has been forked and 
that a new copy has been started as row forkName where 
potentially unsafe progress can be made, but may need to be 
later merged. 


Table 5: API for dealing with catastrophic failures. 
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Figure 3: The average time for a Paxos round to complete with 
and without a leader as we vary the size of the Paxos group. 


2.5.2 Implementation 


Our current implementation of consensus is approxi- 
mately 2100 lines of C++ code implementing a straight- 
forward and largely unoptimized adaptation of the Paxos 
distributed agreement protocol. In Paxos, proposals are 
sent to all participating nodes and accepted if a majority 
of the nodes agree on the proposal. In our implemen- 
tation, one leader is elected per row and all requests for 
that row are forwarded to the leader. If progress stalls, the 
leader is assumed to have failed and a new one is elected 
without concern for contention. If progress on electing 
a leader stalls, then the row can be unsafely forked de- 
pending on the requested forking policy. As nodes fail, 
the Paxos group reconfigures itself to remove the failed 
node from the node set and replace it with a different 
ETTM end-host. 


Figure 3 shows the average time for a round of our 
Paxos implementation to complete when running with 
varying numbers of pc3000 nodes (with 3GHz, 64-bit 
Xeon processors) on Emulab [15]. The results show that 
a Paxos round can be completed within 2 ms when there 
is no leader and within | ms with a leader. While the 
computation necessarily grows linearly with the number 
of nodes, this effect is mitigated by running Paxos on a 
subset of the active ETTM nodes. For example, as we 
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Figure 4: Bandwidth throughput of flows traversing ETTM 
NAT as we vary the flow size. 


will show in our evaluation of the NAT, a Paxos group 
of only 10 nodes—with new machines brought in only to 
replace any departing nodes in the subset—provides suf- 
ficient throughput and availability for the management of 
a large number of network flows. 


3 Network Management Services 


We next describe the design, implementation, and eval- 
uation of several example services we have built using 
ETTM. These services are intended to be proof of con- 
cept examples of the power of making network admin- 
istration a software engineering, rather than a hardware 
configuration, problem. In each case the functionality 
we describe can also be implemented using middleboxes. 
However, a centralized hardware solution increases costs 
and limits reliability, scalability, and flexibility. Propos- 
als exist to implement several of these services as peer- 
to-peer applications on end-hosts [23, 38], but this raises 
questions of enforcement and privacy. Instead, ETTM 
provides the best of both worlds: safe enforcement of 
network management without the limitations of hard- 
ware solutions. 


3.1 NATs 


Network Address Translators (NATs) share a single 
externally-visible IP address among a number of differ- 
ent hosts by maintaining a mapping between externally 
visible TCP or UDP ports and the private, internally- 
visible IP addresses belonging to the hosts. Mappings 
are generated on-demand for each new outgoing connec- 
tion, stored and transparently applied at the NAT device 
itself. Traffic entering the network which does not be- 
longing to an already-established mapping is dropped. 
As a result, passive listeners such as servers and peer-to- 
peer systems can have connectivity problems when lo- 
cated behind NATs. Mappings are usually not replicated, 
so a rebooted NAT will break all connections. 

In contrast, Our ETTM NAT is distributed and fault- 
tolerant. We store the mappings using the consensus API 
allowing any participating AEE to access the complete 
list of mappings. When the NAT filter running in a host’s 
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Figure 5: Throughput performance of ETTM NAT as we vary 
the Paxos group size. 


AEE detects a new outgoing flow, it temporarily hold the 
flow and requests a mapping to an available, externally- 
visible port. This request is satisfied only if the port is 
actually available. Once this request completes, the NAT 
filter begins rewriting the packet headers for the flow and 
allows packets to flow normally. 

Handling incoming traffic is slightly more compli- 
cated. If the physical switches on the network sup- 
port flexible packet forwarding (as with OpenFlow hard- 
ware), they can be configured with soft state to forward 
traffic to the appropriate host where its NAT filter can 
rewrite the destination address.” If the soft state has not 
yet been installed or has been lost due to failure, default 
forwarding rules result in the packet being delivered to 
some host which can appropriately forward the packet 
and install rules in the physical switches as needed. 

Our NAT also works if the physical switches do not 
support re-configurable routing. Instead, we assign the 
globally-visible IP address to a specific AEE and have 
that AEE forward traffic to appropriate hosts. While this 
might appear to be similar to proxying all external traf- 
fic through an end-host, such an approach would be nei- 
ther fault tolerant nor privacy preserving. In contrast, in 
ETTM the AEE allows for packets to be silently redi- 
rected to the appropriate host without those packets being 
visible to the user of the forwarding host. Also, the fail- 
ure of that AEE can be detected and another can be cho- 
sen with no lost state. When selecting an AEE, we use 
historical uptime data as well as information about cur- 
rent load to avoid using unreliable hosts and to avoid un- 
necessarily burdening loaded hosts. While it 1s possible 
that a determined snoop might physically tap their ether- 
net wire to see forwarded packets, deployments that wish 
to prevent this could enforce end-to-end encryption using 
a combination of SSL, IPsec and/or 802.1AE MACsec to 
encrypt all traffic entering or exiting the organization. 

Our NAT can be configured to allow passive connec- 


>We implement address translation in the AEE despite OpenFlow 
support because some of our OpenFlow hardware has worse perfor- 
mance when modifying packets. Further, keeping translation tables 
reliably in AEEs keeps no hard state in the network. 
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Figure 6: Availability of ETTM NAT as we vary the Paxos 
group size. Note the y-axis is in log scale. 


tions to establish mappings. We have implemented a 
Linux kernel module that can be installed in the guest 
OS to explicitly notify the NAT filter whenever bind () 
or listen () 1s called, triggering a request for a valid 
mapping to an external IP address and port. This allows 
the ETTM system to direct incoming connections to the 
appropriate host without having the administrator set up 
customized port forwarding rules. We attempt to provide 
passive connections with the same external port as its in- 
ternal one; if this is not possible, the kernel module can 
be queried for the external port number. 

Note that the ETTM approach for implementing NATs 
reinstates the fate sharing principle. We trivially support 
multiple ingress points to the network because there is 
no hard state stored in the network. A connection only 
fails if either endpoint fails or there is no path between 
them, but not if the middlebox fails. Even if the consen- 
sus group fails entirely, existing flows will still continue 
as long as one member of the group remains; of course, 
new flows may be delayed in this case. 

We evaluated the performance of our NAT module on 
a cluster of pc3000 nodes on Emulab. Figure 4 depicts 
the flow throughputs with and without the NAT module 
for TCP flows of various sizes over a 1 Gbps LAN link. 
The NAT filter imposes some added cost in terms of the 
latency of the first packet (about 1-2 ms), which affects 
the throughput of short flows in the LAN. For all other 
flows, the throughput of the NAT filter matches that of 
the direct communications channel, and it achieves the 
maximum possible throughput of 1 Gbps for large flows. 

Figure 5 plots the throughput of ETTM NAT by mea- 
suring the number of NAT translations that it can estab- 
lish per second as we vary the size of the Paxos group 
operating on behalf of the NAT. While the throughput 
falls with the number of nodes, it is still able to sustain an 
admission rate of 2000 new flows per second even with 
large Paxos groups. Additional scalability would be pos- 
sible if the external port space were partitioned among 
multiple Paxos groups. 

We also model the NAT failure probability using end- 
host availability data collected for hosts within the Mi- 
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(a) Latency by request type with a single centralized cache. 
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(b) Latency by request type with a distributed cache across 6 nodes. 


Figure 7: The cumulative distribution of latencies by type of 
request with a centralized (Figure 7(a)) and distributed (Fig- 
ure 7(b)) web caches. 


crosoft corporate network [12, 5]. The trace data has 
81% of the end-hosts available at any time, and the me- 
dian session length of these end-hosts was in excess of 
16 hours. Figure 6 plots the probability of catastrophic 
failures assuming independent failures and a generous 
failure detection and group reconfiguration delay of 1 
minute. As we can see from this analysis, a handful of 
end-systems would suffice for most enterprise settings. 


3.2 Transparent Distributed Web Cache 


It is common for large networks to employ a transparent 
web cache such as Akamai [1] or squid [38] to improve 
performance and reduce bandwidth costs. These caches 
exploit similarity in different users’ browsing habits to 
reduce the total bandwidth consumption while also im- 
proving throughput and latency for requests served from 
the cache. 


Even though a shared cache is often very effective, 
many small and medium sized networks do not use one 
because of the administrative overhead of setting it up 
and the potential performance bottleneck if the central- 
ized cache is misconfigured. An alternative is to coordi- 
nate caches on each end-host [23], but this requires re- 
configuration by each user and it raises privacy concerns 
since requests can be snooped by anyone with adminis- 
trative privileges on any machine. 


We implemented a distributed and privacy preserving 
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distributed cache. The cache runs as an ETTM network 
management service that is triggered by a p:vrouter filter 
capturing all traffic headed to port 80. The service first 
checks the local AEE’s web cache to see if the request 
can be served from the local host. If it cannot be served 
locally, the service computes a consistent hash of the re- 
quest url and forwards it to a participating remote AEE 
based on the computed hash value. If the remote AEE 
does not have the content cached, it retrieves the content 
from the origin server, stores a copy in its local cache, 
and returns the fetched content to the requesting node. 
Note that the protocol traffic in ETTM 1s captured by the 
web cache filter and is not visible to any of the guest 
OSes. Also, communication between the caches can be 
optionally encrypted to prevent snooping. We adapted 
squid [38] to serve as the cache in each AEE and to pro- 
vide the logic for interpreting http header directives, such 
as when to forward requests to the origin due to cache 
timeouts or outright disabling of caching. 

We evaluated our end-host based web-cache imple- 
mentation using a trace driven simulation. In order to 
generate trace data we aggregated the browser history of 
three of the authors and replayed the trace data on six 
nodes on Emulab [15]. In the centralized experiments, 
all clients but one have their cache disabled and were 
configured to send all requests to the one remaining ac- 
tive cache. In the distributed experiments each node runs 
its own cache. In the centralized case, the single cache is 
set to 600 MB, while in the distributed experiments the 
cache size for each of the six nodes is set to 100 MB. 

Cache hit rates are similar in both cases. For brevity 
we omit detailed analysis of hit rates and instead focus on 
latency. The cumulative distribution of latencies for the 
centralized and distributed caches is shown in Figure 7. 
The latency for objects found in the other node’s caches 
is at most a few milliseconds more than local cache hits, 
indicating that the distributed nature of our implementa- 
tion imposes little or no performance penalty. 


3.3, Deep Packet Inspection 


The ability to filter traffic based on the full packet 
contents and often the contents of multiple packets— 
commonly called deep packet inspection (DPI)—has 
quickly become a standard tool alongside traditional fire- 
walls and intrusion detection systems for detecting se- 
curity breaches. However, the computation required for 
deep packet inspection 1s still limits its deployment. 

The ETTM approach opens the door to ‘outsourcing’ 
the DPI computation to end-hosts where there is almost 
certainly more aggregate compute power than inside a 
dedicated DPI middlebox. Traditionally, the idea of run- 
ning this DPI code at end-hosts would flounder because 
they could not be trusted to execute the code faithfully— 
a virus infecting one host could undermine network secu- 
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Figure 8: CPU load of ETTM DPI module as we vary the 
transfer rate of our trace. 


rity. While no security is invulnerable, we offer a narrow 
attack surface similar to middleboxes, and also use attes- 
tation to be able to make claims about booted software 
and detect malicious changes on reboots. 


Our implementation of DPI is based on the Snort [37] 
engine and renders decisions either by delaying or drop- 
ping traffic or by tagging flows with metadata. The DPI 
filter is run within the end-host AEE and inspects the 
flows being sourced from or received by the end-host. In 
addition, the DPI modules running on end-hosts period- 
ically exchange CPU load information with each other. 
In situations where the end-host CPU is overloaded, as 
in highly-loaded web servers, the flows are redirected to 
some other lightly loaded end-host running the ETTM 
stack in order to perform the DPI tasks. 


The two commonly used applications of DPI are to 
detect possible attacks and to discover obfuscated peer- 
to-peer traffic. In the case of detecting attacks, the filter 
releases traffic after it has been scanned for attack sig- 
natures and found to be clean. If a flow is flagged as an 
attack, no further traffic 1s allowed, and the source 1s la- 
beled as being believed to be compromised. In the case 
of obfuscated peer-to-peer traffic, normal traffic 1s passed 
through the DPI filter without delay, but when a flow is 
categorized as peer-to-peer the flow is labeled with meta- 
data. The next section describes how we can use these 
labels to adjust priorities for peer-to-peer traffic. 

Figure 8 shows benchmark results from a trace-based 
evaluation of our DPI filter. We ran the ETTM stack on a 
quad-core Intel Xeon machine with 4 GB of RAM where 
each core runs at 2 GHz. However, we only make use of 
one core as snort—2.8 1s single-threaded. The traces 
are from DEFCON 17 “capture the flag’ dataset [13], 
which contain numerous intrusion attempts and serve as 
commonly used benchmarks for evaluating DPI perfor- 
mance. We vary the trace playback rate from 1x to 1024x 
and measured the CPU load imposed by our DPI filter 
at various traffic rates. Figure 8 shows the load on the 
ETTM CPU to analyze traffic to/from that CPU. This 
demonstrates that running DPI on a single core per host 
is feasible. Stated in other terms, the ETTM approach 
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of performing DPI computation on end-hosts scales with 
the number of ETTM machines; centralizing DPI com- 
putation on specialized hardware is more expensive and 
less scalable. 


3.4 Bandwidth Allocation 


The ability for ETTM to control network behavior on a 
packet granularity provides an opportunity for more ef- 
ficient bandwidth management. In TCP, hosts increase 
their send rates until router buffers overflow and start 
dropping packets. As a result, it is well-known that the 
latency of short flows degrades whenever a congested 
link is shared with a bandwidth-intensive flow. Many 
large enterprises deploy hardware-based packet shapers 
at the edge of the network to throttle high bandwidth 
flows before they overwhelm the bottleneck link. In 
this subsection, we demonstrate a backwardly compat- 
ible software-based ETTM solution to this issue; we use 
this as an illustration of how ETTM can be used to im- 
prove quality-of-service in an enterprise setting. 

We call our bandwidth allocation strategy TCP with 
reservations or TCP-R; the approach is similar to the ex- 
plicit bandwidth signaling in ATM. In TCP-R, bandwidth 
allocations for the bottleneck access link are performed 
by acontroller replicated using the consensus API. End- 
points managing TCP flows make bandwidth allocation 
requests to the controller, which responds with reserva- 
tions for short periods of time. We next describe the logic 
executed end-hosts followed by the controller logic. 


Endpoint: Whenever a new flow crossing the access link 
appears and every RTT after that, the bandwidth alloca- 
tion filter on the local host issues a bandwidth reservation 
request to the controller. The request is for the maximum 
bandwidth the host needs, that can be allocated safely 
without causing queueing at the congested link. The con- 
troller responds with an allocation and a reservation for 
the subsequent round-trips. 

Once the reservation has been agreed upon, the filter 
limits the flow to using that amount of bandwidth until 
it issues a subsequent reservation. The amount of the 
new reservation is based on the last RTT of behavior. Let 
A (i — 1) be the bandwidth allocated to flow f in period 
1 — 1, and let U(i — 1) be the bandwidth utilized by it 
during the period. Then it makes a reservation request 
Ry(z) based on the following logic; this preserves TCP 
behavior for the portion of the path external to the LAN, 
while allowing for explicit allocation of the access link. 


e If the flow used up its allocation, it asks the controller 
to provide it the maximum allowed by the TCP con- 
gestion window (Rf f(2) = cwnd/ RTT). 

e If the flow did not use up its bandwidth allocation in 
the previous RTT, then it issues a new request for the 
lesser of the bandwidth it did use and the TCP con- 
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gestion window, relinquishing its unused reservation 


(R(t) = min(ewnd/ RTT, U¢ (i — 1))). 


Controller: The controller allocates bandwidth among 
the reservation requests according to max-min fairness. 
It publishes the results by committing its allocation deci- 
sion across the various controller instances using Paxos. 
Note that the actual reservation amount can be less than 
what was requested. 


Periodically the controller processes the bandwidth 
requests and makes an allocation using the following 
scheme to achieve max-min fairness. It sorts the flows 
based on their requested bandwidth. Let Ro < Ro < 
R3...Rp_g < Ry_ 1 be the set of sorted bandwidth re- 
quests, L be the link access bandwidth, and A = 0 be 
the allocated bandwidth at the beginning of each allo- 
cation round. The controller considers these requests in 
increasing order and the requested bandwidth or its fair 
share, whichever is lower. Concretely, for each flow 7, 
it does the following: A; = min(R,, 43) and sets 
A = A+ A;. Note that is is the fair share of flow 
7 after having allocated A bandwidth resources to the 7 
flows considered before it. 


In practice, because it takes some time to acquire a 
reservation, we leave some fraction of the link (10% in 
our implementation) unallocated and allow each flow to 
send a few packets (4 in our implementation) before re- 
ceiving a reservation. Because the time to acquire a 
reservation (a millisecond or less) is smaller than most 
Internet round trip times, this avoids adversely affecting 
flows with increased latency. 


TCP-R has many benefits over traditional TCP. It does 
not drive the bottleneck link to saturation, thereby avoid- 
ing losses and sub-optimal use of network resources. In 
particular, latency sensitive web traffic can obtain their 
share of the bandwidth resource even if there are simul- 
taneous large background transfers. 


This implementation of bandwidth allocation assumes 
that we are only managing the upload bandwidth of our 
access link. In the future, we will to extend our imple- 
mentation to handle arbitrary bottlenecks as well as the 
allocation of incoming bandwidth. 


Evaluation: Our evaluation illustrates the ability of the 
ETTM bandwidth allocator to provide a fair allocation to 
interactive web traffic. On Emulab, we set up an access 
link with a bottleneck bandwidth of 10 Mb/s and com- 
pared the latency of accessing google.com with and 
without background BitTorrent traffic that is generated 
by a different end-host in the network. Figure 9 depicts 
the webpage access latency at different points in time. 
When there is no competing traffic, the average access 
latency is 0.68 seconds. When there is competing traf- 
fic (during attempts 11 through 30), the average access 
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Figure 9: Webpage access latency in the presence of compet- 
ing BitTorrent traffic with and without the bandwidth allocator. 
The solid lines depict the access latency when there is compet- 
ing BitTorrent traffic. 


latency 1s 5.67 seconds if we don’t use the ETTM band- 
width allocator. With the ETTM bandwidth allocator, the 
interactive web traffic receives a fair share and incurs a 
latency of 1.04 seconds. 


4 Related Work 


Providing network administrators more control at lower 
cost is a longstanding goal of network research. Sev- 
eral recent projects have focused on providing adminis- 
trators a logically centralized interface for configuring a 
distributed set of network routers and switches. Exam- 
ples of this approach include 4D [34, 17, 42], NOX [19], 
Ethane [8, 7], Maestro [6] and CONMan [2]. Of course, 
the power of these systems is limited to the configurabil- 
ity of the hardware they control. While we agree with the 
need for logical centralization of network management 
functions, our hypothesis is that network administrators 
would prefer fine-grained, packet level control over their 
networks, something that is not possible at line-rate with 
today’s current low cost network switches. 

Other efforts have focused on building drop-in re- 
placements for the the virtual ethernet switch inside 
existing hypervisors. Cisco’s Nexus 1000V_ virtual 
switch [9, 40] provides a standard Cisco switch interface 
enabling switching policies to to the edge of VMs as well 
as hosts. Open vSwitch [33] accomplishes a similar feat, 
but provides an OpenFlow interface to the virtual switch 
and is compatible with Xen and a few other hypervisors. 
Still others are working to do hardware network I/O vir- 
tualization [32]. While all of these tools give network 
administrators additional points of control, they do not 
offer the flexibility required to implement the breadth of 
coordinated network polices administrators seek today. 
Instead, we are working to incorporate these standard- 


ized, simple points of control into ETTM to provide po- 
tentially higher performance some tasks and added con- 
trol over the low-level network. 

Other systems have tried to bring end-hosts into net- 
work management, though in limited ways. Microsoft’s 
Active Directory includes Group Policy which allows for 
control over the actions which connected Windows hosts 
are allowed to carry out, but enforces them only assum- 
ing the host remains uncompromised. Network Excep- 
tion Handlers [24] allow end-hosts to react to certain 
network events, but still leaves network hardware domi- 
nantly in control. Still other work [11] uses end-hosts to 
provide visibility into network traffic, but does not pro- 
vide a point of control and assumes that the host remains 
uncompromised. 

Other recent work has attempted to increase the flex- 
ibility of network switches to carry out administrative 
tasks. OpenFlow [30] adds the ability to configure rout- 
ing and filtering decisions in LAN switches based on pat- 
tern matching on packet headers performed in hardware. 
A limitation of OpenFlow is throughput when packets 
need to be processed out of band, because there is typi- 
cally only one underpowered control processor per LAN 
switch. In ETTM, we invoke out of band processing on 
the switch only for the initial TPM verification when the 
node connects, while still allowing the network adminis- 
trator to add arbitrary processing on every packet. 

Middleboxes have always been a contentious topic, 
but recent work has looked at how to embrace mid- 
dleboxes and treat them as first-class citizens. In 
TRIAD [18] middleboxes are first-order constructs in 
providing a content-addressable network architecture. 
The Delegation-Oriented Architecture [41] allows hosts 
to explicitly invoke middleboxes, while NUTSS [20] 
proposes a novel connection establishment mechanism 
which includes negotiation of which middleboxes should 
be involved. Our work can be seen as enabling network 
administrators to place arbitrary packet-granularity mid- 
dlebox functionality throughout the network, via vali- 
dated software running on end-hosts. 

Existing work has leveraged trusted computing hard- 
ware to avoid vulnerabilities in commodity software [35] 
as well as to ensure correct execution of specific 
tasks [29]. Our use of trusted computing hardware is 
complementary to these efforts. 


5 Conclusion 


Enterprise-level network management today is complex, 
expensive and unsatisfying: seemingly straightforward 
quality of service and security goals can be difficult to 
achieve even with an unlimited budget. In this paper, we 
have designed, implemented and evaluated a novel ap- 
proach to provide network administrators more control 
at lower cost, and their users higher performance, more 
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reliability, and more flexibility. Network management 
tasks are implemented as software applications running 
in a distributed but secure fashion on every end-host, in- 
stead of on closed proprietary hardware at fixed points 
in the network. Our approach leverages the increasing 
availability of trusted computing hardware on end-hosts 
and reconfigurable routing tables in network switches, 
as well as the expansive computing capacity of modern 
multicore architectures. We show that our approach can 
support complex tasks such as fault tolerant network ad- 
dress translation, network-wide deep packet inspection 
for virus control, privacy preserving peer-to-peer web 
caching, and congested link bandwidth prioritization, all 
with reasonable performance despite the added overhead 
of fault tolerant distributed coordination. 
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ABSTRACT 


Multipath TCP, as proposed by the IETF working group 
mptcp, allows a single data stream to be split across 
multiple paths. This has obvious benefits for reliability, 
and it can also lead to more efficient use of networked 
resources. We describe the design of a multipath con- 
gestion control algorithm, we implement it in Linux, 
and we evaluate it for multihomed servers, data centers 
and mobile clients. We show that some ‘obvious’ solu- 
tions for multipath congestion control can be harmful, 
but that our algorithm improves throughput and fairness 
compared to single-path TCP. Our algorithm is a drop-in 
replacement for TCP, and we believe it is safe to deploy. 


1. INTRODUCTION 


Multipath TCP, as proposed by the IETF working group 


mptcp [7], allows a single data stream to be split across 
multiple paths. This has obvious benefits for reliability — 
the connection can persist when a path fails. It can also 
have benefits for load balancing at multihomed servers 
and data centers, and for mobility, as we show below. 

Multipath TCP also raises questions, some obvious 
and some subtle, about how network capacity should be 
shared efficiently and fairly between competing flows. 
This paper describes the design and implementation of 
a multipath congestion control algorithm that works ro- 
bustly across a wide range of scenarios and that can be 
used as a drop-in replacement for TCP. 

In 82 we propose a mechanism for windowed con- 
gestion control for multipath TCP, and then spell out 
the questions that led us to it. This section is presented 
as a walk through the design space signposted by perti- 
nent examples and analysed by calculations and thought 
experiments. It is not an exhaustive survey of the de- 
sign space, and we do not claim that our algorithm is 
optimal—to even define optimality would require a more 
advanced theoretical underpinning than we have yet de- 
veloped. Some of the issues (82.1—-82.3) have previ- 
ously been raised in the literature on multipath conges- 
tion control, but not all have been solved. The others 
(§2.4-82.5) are novel. 

In 83-85 we evaluate our algorithm in three applica- 
tion scenarios: multihomed Internet servers, data cen- 
ters, and mobile devices. We do this by means of simu- 
lations with a high-speed custom packet-level simulator, 


and with testbed experiments on a Linux implementa- 
tion. We show that multipath TCP is beneficial, as long 
as congestion control is done right. Naive solutions can 
be worse than single-path TCP. 

In 86 we discuss what we learnt from implementing 
the protocol in Linux. There are hard questions about 
how to avoid deadlock at the receiver buffer when pack- 
ets can arrive out of order, and about the datastream se- 
quence space versus the subflow sequence spaces. But 
careful consideration of corner cases forced us to our 
specific implementation. In 87 we discuss related work 
on protocol design. 

In this paper we will restrict our attention to end- 
to-end mechanisms for sharing capacity, specifically to 
modifications to TCP’s congestion control algorithm. We 
will assume that each TCP flow has access to one or 
more paths, and it can control how much traffic to send 
on each path, but it cannot specify the paths themselves. 
For example, our Linux implementation uses multihom- 
ing at one or both ends to provide path choice, but it 
relies on the standard Internet routing mechanisms to 
determine what those paths are. Our reasons for these 
restrictions are (i) the IETF working group is working 
under the same restrictions, (11) they lead to a readily 
deployable protocol, i.e. no modifications to the core of 
the Internet, and (i11) theoretical results indicate that in- 
efficient outcomes may arise when both the end-systems 
and the core participate in balancing traffic [1]. 


2. THE DESIGN PROBLEM FOR MUL- 
TIPATH RESOURCE ALLOCATION 


The basic window-based congestion control algorithm 
employed in TCP consists of additive increase behaviour 
when no loss is detected, and multiplicative decrease 
when a loss event is observed. In short: 


ALGORITHM: REGULAR TCP 


e Each ACK, increase the congestion window w by 
1/w, resulting in an increase of one packet per RTT.! 


e Each loss, decrease w by w/2. 
Additionally, at the start of a connection, an exponen- 


tial increase is used, as it is immediately after a retrans- 
mission timeout. Newer versions of TCP [24, 9] have 


'For simplicity, we express windows in this paper in packets, 
but real implementations usually maintain them in bytes. 
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Figure 1: A scenario which shows the importance 
of weighting the aggressiveness of subflows. 


faster behaviour when the network is underloaded; we 
believe our multipath enhancements can be straightfor- 
wardly applied to these versions, but it is a topic for fur- 
ther work. 

The congestion control algorithm we propose is this: 


ALGORITHM: MPTCP 
A connection consists of set of subflows R, each of which 
may take a different route through the Internet. Each 
subflow r € RA maintains its own congestion window 
w,. An MPTCP sender stripes packets across these sub- 
flows as space in the subflow windows becomes avail- 
able. The windows are adapted as follows: 

e Each ACK on subflow r, for each subset S C R that 

includes path r, compute 


maxseg w, /RTT2 


(Ones ws/RTTs)- | 


then find the minimum over all such S', and increase 
w, by that much. (The complexity of finding the 
minimum is linear in the number of paths, as we 
show in the appendix.) 


(1) 


e Each loss on subflow r, decrease the window w,. by 
wr | 2. 

Here RTT,. is the round trip time as measured by sub- 

flow r. We use a smoothed RTT estimator, computed 

similarly to TCP. 

In our implementation, we compute the increase pa- 
rameter only when the congestion windows grow to ac- 
commodate one more packet, rather than every ACK on 
every subflow. 


The following subsections explain how we arrived at 
this design. The basic question we set out to answer 1s 
how precisely to adapt the subflow windows of a mul- 
tipath TCP so as to get the maximum performance pos- 
sible, subject to the constraint of co-existing gracefully 
with existing TCP traffic. 


2.1 Fairness at shared bottlenecks 


The obvious question to ask is why not just run regu- 
lar TCP congestion control on each subflow? Consider 
the scenario in Fig. 1. If multipath TCP ran regular 
TCP congestion control on both paths, then the multi- 
path flow would obtain twice as much throughput as the 
single path flow (assuming all RTTs are equal). This is 
unfair. An obvious solution is to run a weighted version 
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Figure 2: A scenario to illustrate the importance 
of choosing the less-congested path 


of TCP on each subflow, weighted so as to take some 
fixed fraction of the bandwidth that regular TCP would 
take. The weighted TCP proposed by [5] is not suitable 
for weights smaller than 0.5, so instead [11] consider the 
following algorithm, EWTCP. 


ALGORITHM: EWTCP 

e For each ACK on path r, increase window w,. by 
a/ Wr. 

e For each loss on path r, decrease window w,. by 
Wie) 2, 

Here w,. is the window size on path r, and a = 1/,/n 

where n is the number of paths. 


Each subflow gets window size proportional to a? [11]. 
By choosing a = 1/,/n, and assuming equal RTTs, 
the multipath flow gets the same throughput as a reg- 
ular TCP at the bottleneck link. This is an appealingly 
simple mechanism in that it does not require any sort of 
explicit shared-bottleneck detection. 


2.2 Choosing efficient paths 


Athough EWTCP can be fair to regular TCP traffic, it 
would not make very efficient use of the network. Con- 
sider the somewhat contrived scenario in Fig.2, and sup- 
pose that the three links each have capacity 12Mb/s. If 
each flow split its traffic evenly across its two paths’, 
then each subflow would get 4Mb/s hence each flow 
would get 8Mb/s. But if each flow used only the one-hop 
shortest path, it could get 12Mb/s. (In general, however, 
it is not efficient to always use only shortest paths, as the 
simulations in $4 of data center topologies show.) 

A solution has been devised in the theoretical litera- 
ture on congestion control, independently by [15] and 
[10]. The core idea is that a multipath flow should shift 
all its traffic onto the least-congested path. In a situa- 
tion like Fig. 2 the two-hop paths will have higher drop 
probability than the one-hop paths, so applying the core 
idea will yield the efficient allocation. Surprisingly it 


*In this topology EWTCP wouldn’t actually split its traf- 
fic evenly, since the two-hop path traverses two bottleneck 
links and so experiences higher congestion. In fact, as TCP’s 
throughput is inversely proportional to the square root of loss 
rate, EWTCP would end up sending approximately 3.5Mb/s 
on the two-hop path and 5Mb/s on the single-hop path, a total 
of 8.5Mb/s—slightly more than with an even split, but much 
less than with an optimal allocation. 
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turns out that this can be achieved (in theory) without 
any need to explicitly measure congestion*®. Consider 


the following algorithm, called COUPLED?: 


ALGORITHM: COUPLED 

e For each ACK on path r, increase window w,. by 
1/ Wtotal- 

e For each loss on path r, decrease window w,. by 
Wtotal / 2. 

Here wiota) 18 the total window size across all subflows. 

We bound w, to keep it non-negative; in our experi- 

ments we bound it to be > I1pkt, but for the purpose 

of analysis it is easier to think of it as > 0. 


To get a feeling for the behaviour of this algorithm, 
we now derive an approximate throughput formula. Con- 
sider first the case that all paths have the same loss rate 
p. Each window w,. is made to increase on ACKs, and 
made to decrease on drops, and in equilibrium the in- 
creases and decreases must balance out, 1.e. rate of ACKs 
x average increase per ACK must equal rate of drops x 
average decrease per drop, 1.e. 


Wy 1 Wr Wtotal 
es ) - ( ) -  Q 
( RTT ( P) Wtotal RTT 7 ( ) 


Solving for wWrotal ZIVES Wiotal = + 2(1— p)/p & V 2/p 


(where the approximation is good if p is small). Note 
that when there is just one path then COUPLED reduces 
to regular TCP, and that the formula for wiotaj does not 
depend on the number of paths, hence COUPLED auto- 
matically solves the fairness problem in 82.1. 

For the case that the loss rates are not all equal, let 
p, be the loss rate on path r and let pmin be the mini- 
mum loss rate seen over all paths. The increase and de- 
crease amounts are the same for all paths, but paths with 
higher p, will see more decreases, hence the equilib- 
rium window size on a path with p,;, > Dmin 1S Ww, = O. 
In Fig.2, the two-hop paths go through two congested 
links, hence they will have higher loss rates than the one- 
hop paths, hence COUPLED makes the efficient choice 
of using only the one-hop paths. 

An interesting consequence of moving traffic away 
from more congested paths is that loss rates across the 
whole network will tend to be balanced. See 83 for ex- 
periments which demonstrate this. Or consider the net- 
work shown in Fig.3, and assume all RTTs are equal. 














°Of course it can also be achieved by explicitly measuring 
congestion as in [11], but this raises tricky measurement ques- 
tions. 

*COUPLED is adapted from [15, equation (21)] and [10, equa- 
tion (14)], which propose a differential equation model for a 
rate-based multipath version of ScalableTCP [16]. We applied 
the concepts behind the equations to classic window-based 
TCP rather than to a rate-based version of ScalableTCP, and 
translated the differential equations into a congestion control 
algorithm. 
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Figure 3: A scenario where EWTCP (left) 


does not equalize congestion or total throughput, 
whereas COUPLED (right) does. 


WiFi: 
RTT; = 10ms 
p1 = 4% loss 


3G; 
RTT2 = 100ms a 


p2 = 1% loss 


Figure 4: A scenario in which RTT and congestion 
mismatch can lead to low throughput. 


Under EWTCP each link will be shared evenly between 
the subflows that use it, hence flow A gets throughputs 
5 and 6 Mb/s, B gets 6 and 5 Mb/s, and C’ gets 5 and 
3 Mb/s. Since TCP throughput is inversely related to 
drop probability, we deduce that the 3Mb/s link has the 
highest drop probability and the 12Mb/s link the low- 
est. For COUPLED, we can calculate the throughput on 
each subflow by using two facts: that a flow uses a path 
only if that path has the lowest loss rate pmin among its 
available paths, and that a flow’s total throughput is pro- 
portional to \/2/Pmin; the only outcome consistent with 
these facts is for all four links to have the same loss rate, 
and for all flows to get the same throughput, namely 
10Mb/s. 

In this scenario the rule “only use a path if that path 
has lowest drop probability among available paths” leads 
to balanced congestion and balanced total throughput. 
In some scenarios, these may be desirable goals per se. 
Even when they are not the primary goals, they are still 
useful as a test: a multipath congestion control algo- 
rithm that does not balance congestion in Fig.3 is un- 
likely to make the efficient path choice in Fig.2. 


2.3 Problems with RTT mismatch 


Both EWTCP and COUPLED have problems when 
the RTTs are unequal. This is demonstrated by experi- 
ments in 85. To understand the issue, consider the sce- 
nario of a wireless client with two interfaces shown in 
Fig.4: the 3G path typically uses large buffers, result- 
ing in long delays and low drop rates, whereas the wifi 
path might have smaller delays and higher drop rate. As 
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Figure 5: A scenario where multipath TCP might 
get ‘trapped’ into using a less desirable path. 





a simple approximation, take the drop rates to be fixed 
(though in practice, e.g. in the experiments in §5, the 
drop rate will also depend on the sender’s data rate). 
Also, take the throughput of a single-path TCP to be 


\/2/p/RTT pkt/s. Then 


e A single-path WiFi flow would get 707 pkt/s, and a 
single-path 3G flow would get 141 pkt/s. 


e EWTCP is half as aggressive as single-path TCP 
on each path, so it will get total throughput (707 + 
141) /2 = 424 pkt/s. 


e COUPLED will send all its traffic on the less con- 
gested path, on which it will gets the same window 
size as single-path TCP, so it will get total through- 
put 141 pkt/s.> 


Both EWTCP and COUPLED are undesirable to a user 
considering whether to adopt multipath TCP. 

One solution is to switch from window-based control 
to rate-based control; the rate-based equations [15, 10] 
that inspired COUPLED do not suffer from RTT mis- 
match. But this would be a drastic change to the In- 
ternet’s congestion control architecture, a change whose 
time has not yet come. Instead, we have a practical sug- 
gestion for window-based control, which we describe 
in $2.5. First though we describe another problem with 
COUPLED and our remedy. 


2.4 Adapting to load changes 


It turns out there is another pitfall with COUPLED, 
which shows itself even when all subflows have the same 
RTT. Consider the scenario in Fig. 5. Initially there are 
two single-path TCPs on each link, and one multipath 
TCP able to use both links. It should end up balancing 
itself evenly across the two links, since if it were uneven 
then one link would be more congested than the other 
and COUPLED would shift some of its traffic onto the 
less congested. Suppose now that one of the flows on 
the top link terminates, so the top link is less congested, 
hence the multipath TCP flow moves all its traffic onto 
the top link. But then it is ‘trapped’: no matter how 
much extra congestion there is on the top link, the the 
multipath TCP flow is not using the bottom link, so it 


The ‘proportion manager’ in the multipath algorithm of [11] 
will also move all the traffic onto the less congested path, with 
the same outcome. 
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gets no ACKs on the bottom link, so COUPLED is un- 
able to increase the window size on the bottom subflow. 
The same problem is demonstrated in experiments in 83. 

We can conclude that the simple rule “Only use the 
least congested paths” needs to be balanced by an op- 
posing consideration, “Always keep sufficient traffic on 
other paths, as a probe, so that you can quickly discover 
when they improve.” In fact, our implementation of 
COUPLED keeps window sizes > 1pkt, so it always has 
some probe traffic. And the theoretical works [15, equa- 
tion (11)] and [10, equation (14)] that inspired COU- 
PLED also have a parameter that controls the amount of 
probing; the theory says that with infinitesimal probing 
one can asymptotically (after a long enough time, and 
with enough flows) achieve fair and efficient allocations. 

But we found in experiments that if there is too lit- 
tle probe traffic then feedback about congestion is too 
infrequent for the flow to discover changes in a reason- 
able time. Noisy feedback (random packet drops) makes 
it even harder to get a quick reliable signal. As a com- 
promise, we propose the following. 


ALGORITHM: SEMICOUPLED 

e For each ACK on path r, increase window w,. by 
a/ Wtotal - 

e For each loss on path r, decrease window w,. by 
Wipf 2, 

Here a is a constant which controls the aggressiveness, 

discussed below. 


SEMICOUPLED tries to keep a moderate amount of 
traffic on each path while still having a bias in favour of 
the less congested paths. For example, suppose a SEMI- 
COUPLED flow is using three paths, two with drop prob- 
ability 1% and a third with drop probability 5%. We can 
calculate equilibrium window sizes by a balance argu- 
ment similar to (2); when 1 — p, & 1 the window sizes 
are 


ti Ig i Pr 
VXs1/Ps 
In three-path example, the flow will put 45% of its weight 
on each of the less congested path and 10% on the more 
congested path. This is intermediate between EWTCP 
(33% on each path) and COUPLED (0% on the more con- 
gested path). 

To achieve fairness in scenarios like Fig.1, one can 
fairly simply tune the a parameter. For more compli- 
cated scenarios like Fig.4, we need a more rigorous def- 
inition of fairness, which we now propose. 


2.5 Compensating for RTT mismatch 


In order to reason about bias and fairness in a prin- 
cipled way, we propose the following two requirements 
for multipath congestion control: 
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Figure 6: Fairness constraints for a two-path flow. 
Constraint (3) on the left, constraints (4) on the 
right. 


e A multipath flow should give a connection at least 
as much throughput as it would get with single-path 
TCP on the best of its paths. This ensures there is an 
incentive for deploying multipath. 


e A multipath flow should take no more capacity on 
any path or collection of paths than if it was a single- 
path TCP flow using the best of those paths. This 
guarantees it will not unduly harm other flows at a 
bottleneck link, no matter what combination of paths 
passes through that link. 


In mathematical notation, suppose the set of available 
paths is R, let w, be the equilibrium window obtained 
by multipath TCP on path r, and let wi°" be the equi- 
librium window that would be obtained by a single-path 
TCP experiencing path r’s loss rate. We shall require 


- 
RTT 
rER 
“~ TCP 


Dare SBS a 
RIT; ~ ves RITy 
reo 


“TCP 





(3) 





a a 
> TER RTT, 








forall S C R. (4) 


These constraints are illustrated, for a two-path flow, in 
Fig.6. The left hand figure illustrates (3), namely that 
(tw , W2) should lie on or above the diagonal line. The 
exact slope of the diagonal is dictated by the ratio of 


RTTs, and here we have chosen them so that w5°? /RTT2 > 


wy? /RTT;. The right hand figure illustrates the three 
constraints in (4). The constraint for S = {path, } says 
to pick a point on or left of the vertical line. The con- 
straint for S = {path,} says to pick a point on or be- 
low the horizontal line. The joint bottleneck constraint 
(S = {path,,path,}) says to pick a point on or below 
the diagonal line. Clearly the only way to satisfy both 
(3) & (4) is to pick some point on the diagonal, inside 
the box; any such point is fair. (Separately, the consid- 
erations in 82.2 say we should prefer the less-congested 
path, and in this figure wj°? > w°P hence the loss rates 
satisfy p1 < po, hence we should prefer the right hand 
side of the diagonal line.) 

The following algorithm, a modification of SEMICOU- 
PLED, satisfies our two fairness requirements, when the 
flow has two paths available. EWTCP can also be fixed 
with a similar modification. The experiments in 85 show 


that the modification works. 


ALGORITHM 

e Each ACK on subflow r, increase the window w,. by 
min(a/Wrota, 1/w,-). 

e Each loss on subflow r, decrease the window w,. by 
tin) 2, 

Here 

max, w, /RTT? 

(0, Wr/ RTT)?’ 

w, 1s the current window size on path r and wi, is the 


equilibrium window size on path r, and similarly for 
Wtotal and Wtotal - 


(5) 


Aa = UWtotal 


The increase and decrease rules are similar to SEMI- 
COUPLED, so the algorithm prefers less-congested paths. 
The difference is that the window increase is capped at 
1/w,, which ensures that the multipath flow can take 
no more capacity on either path than a single-path TCP 
flow would, i.e. it ensures we are inside the horizontal 
and vertical constraints in Fig.6. 

The parameter a controls the aggressiveness. Clearly 
if a is very large then the two flows act like two inde- 
pendent flows hence the equilibrium windows will be at 
the top right of the box in Fig.6. On the other hand if a 
is very small then the flows will be stuck at the bottom 
left of the box. As we said, the two fairness goals re- 
quire that we exactly hit the diagonal line. The question 
is how to find a to achieve this. 

We can calculate a from the balance equations. At 
equilibrium, the window increases and decreases bal- 
ance out on each path, hence 


1 Wr 
l=» min(——, —) = = Dr —. 
( r) Wtotal Wr 2 
Making the approximation that p,. is small enough that 
1 — p, © 1, and writing it in terms of w/°? = ,/2/p,, 


(6) 


By simultaneously solving (3) (with the inequality re- 
placed by equality) and (6), we arrive at (5). 

Our final MPTCP algorithm, specified at the begin- 
ning of §2, is a generalization of the above algorithm to 
an arbitrary number of paths. The proof that it satisfies 
(3)-(4) is in the appendix. The formula (5) technically 
requires w,, the equilibrium window size, whereas in 
our final algorithm we have used the instantaneous win- 
dow size instead. The experiments described below in- 
dicate that this does not cause problems. 


x Wtotal Wr x 
max (we, CU a 
a 


Trying too hard to be fair? Our fairness goals say 
“take no more than a single-path TCP”. At first sight 
this seems overly restrictive. For example, consider a 
single-path user with a 14.4Mb/s WiFi access link, who 
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Figure 7: Torus topology. We 
adjust the capacity of link C’, and 
test how well congestion is bal- 
anced. 


then adds a 2Mb/s 3G access link. Shouldn’t this user 
now get 16.4Mb/s, and doesn’t the fairness goal dictate 
14.4Mb/s? 

We describe tests of this scenario, and others like it, 
in 85. MPTCP does in fact give throughput equal to the 
sum of access link bandwidths, when there is no com- 
peting traffic. When there is competing traffic on the 
access links, the answer is different. 

To understand what’s going on, note that our precise 
fairness goals say “take no more than would be obtained 
by a single-path TCP experiencing the same loss rate”. 
Suppose there is no competing traffic on either link, and 
the user only takes 14.4Mb/s. Then one or other of 
the two access links is underutilized, so it has no loss, 
and a hypothetical single-path TCP with no loss should 
get very high throughput, so the fairness goal allows 
MPTCP to increase its throughput. The system will 
only reach equilibrium once both access links are fully 
utilized. See 85 for further experimental results, includ- 
ing scenarios with competing traffic on the access links. 


3. BALANCING CONGESTION AT 
A MULTIHOMED SERVER 


In §3-85 we investigate the behaviour of multipath 
TCP in three different scenarios: a multihomed Inter- 
net server, a data center, and a mobile client. Our aim 
in this paper is to produce one multipath algorithm that 
works robustly across a wide range of scenarios. These 
three scenarios will showcase all the design decisions 
dicussed in 82 —though not all the design decisions are 
important in every one of the scenarios. 


The first scenario is a multihomed Internet server. Mul- 


tihoming of important servers has become ubiquitous 
over the last decade; no company reliant on network 
access for their business can afford to be dependent on 
a single upstream network. However, balancing traffic 
across these links is difficult, as evidenced by the hoops 
operators jump through using BGP techniques such as 
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Figure 8: Effect of changing the 
capacity of link C’ on the ratio of 
loss rates pc /pa. All other links 
have capacity 1000pkt/s. 


Figure 9: Bursty CBR traffic on 
the top link requires quick re- 
sponse by the multipath flow. 


prefix splitting and AS prepending. Such techniques are 
coarse-grained, very slow, and a stress to the global rout- 
ing system. In this section we will show that multipath 
transport can balance congestion, even when only a mi- 
nority of flows are multipath-capable. 

We will first demonstrate congestion balancing in a 
simple simulation, to illustrate the design discussion in 
$2 and to compare MPTCP to EWTCP and COUPLED. 
In the static scenario COUPLED is better than MPTCP 
is better than EWTCP, and in the dynamic scenario the 
order is reversed —but in each case MPTCP is close to 
the best, so it seems to be a reasonable compromise. We 
will then validate our findings with a result from an ex- 
perimental testbed running our Linux implementation. 


Static load balancing simulation. First we shall in- 
vestigate load balancing in a stable environment of long- 
lived flows, testing the predictions in 82.2. Fig.7 shows 
a scenario with five bottleneck links arranged in a torus, 
each used by two multipath flows. All paths have equal 
RTT of 100ms, and the buffers are one bandwidth-delay 
product. We will adjust the capacity of link C’. When 
the capacity of link C’ is reduced then it will become 
more congested, so the two flows using it should shift 
their traffic towards B and D, so those links become 
more congested, so there is a knock-on effect and the 
other flows should shift their traffic onto links A and E. 
With perfect balancing, the end result should be equal 
congestion on all links. 

Fig.8 plots the imbalance in congestion as a function 
of the capacity of link C’. When all links have equal ca- 
pacity (C_ = 1000pkt/s) then congestion is of course 
perfectly balanced for all the algorithms. When link 
C’ is smaller, the imbalance is greater. COUPLED is 
very good at balancing congestion, EWTCP is bad, and 
MPTCP is in between. We also find that balanced con- 
gestion results in better fairness between total flow rates: 
when link C’ has capacity 100 pkt/s then Jain’s fairness 
index is 0.99 for the flow rates with COUPLED, 0.986 
for MPTCP and 0.92 for EWTCP. 
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Figure 10: Server load balancing with MPTCP 


Dynamic load balancing simulation. Next we illus- 
trate the problem with dynamic load described in 82.4. 
We ran a simulation with two links as in Fig.9, both of 
capacity 1OOMb/s and buffer 50 packets, and one mul- 
tipath flow where each path has a 10ms RTT. On the 
top link there is an additional bursty CBR flow which 
sends at 1|OOMb/s for a random duration of mean 10ms, 
then is quiet for a random duration of mean 100ms. The 
multipath flow ought to use only the bottom link when 
the CBR flow is present, and it ought to quickly take up 
both links when the CBR flow is absent. We reasoned 
that COUPLED would do badly, and the throughputs we 
obtain confirm this. In Mb/s, they are 
top link bottom link 


EWTCP 85 100 
MPTCP 83 99.8 
COUPLED a0 99.4 


We have found similar problems in a wide range of 
different scenarios. The exact numbers depend on how 
quickly congestion levels change, and in this illustra- 
tion we have chosen particularly abrupt changes. One 
might expect similarly abrupt changes for a mobile de- 
vices when coverage on one radio interface is suddenly 
lost and then recovers. 


Server load balancing experiment. We next give re- 
sults from an experimental testbed that show our Linux 
implementation of MPTCP balancing congestion, vali- 
dating the simulations we have just presented. 

We first ran a server dual-homed with two 100Mb/s 
links and a number of client machines. We used dum- 
mynet to add 10ms of latency to simulate a wide-area 
scenario. We ran 5 client machines connecting to the 
server on link 1 and 15 on link 2, both using long-lived 
flows of Linux 2.6 NewReno TCP. The first minute of 
Fig.10 shows the throughput that is achieved—clearly 
there is more congestion on link 2. Then we started 
10 multipath flows able to use both links. Perfect load 
balancing would require these new flows to shift com- 
pletely to link 1. This is not perfectly achieved, but 


nonetheless multipath helps significantly to balance load, 
despite constituting only 1/3 the total number of flows. 
The figure only shows MPTCP; COUPLED was simi- 
lar and EWTCP was slightly worse as it pushed more 
traffic onto link 2. 

Our second experiment used the same topology. On 
link 1 we generated Poisson arrivals of TCP flows with 
rate alternating between 10/s (light load) and 60/s (heavy 
load), with file sizes drawn from a Pareto distribution 
with mean 200KB. On link 2 we ran a single long-lived 
TCP flow. We also ran three multipath flows, one for 
each multipath algorithm. Their average throughputs 
were 61Mb/s for MPTCP, 54Mb/s for COUPLED, and 
47Mb/s for EWTCP. In heavy load EWTCP did worst 
because it did not move as much traffic onto the less con- 
gested path. In light load COUPLED did worst because 
bursts of traffic on link 1 pushed it onto link 2, where it 
remained ‘trapped’ even after link | cleared up. 


4. EFFICIENT ROUTING 
IN DATA CENTERS 


Growth in cloud applications from companies such 
as Google, Microsoft and Amazon has resulted in huge 
data centers in which significant amounts of traffic are 
shifted between machines, rather than just out to the In- 
ternet. To support this, researchers have proposed new 
architectures with much denser interconnects than have 
traditionally been implemented. Two such proposals, 
FatTree [2] and BCube [8], are illustrated in Fig.11. The 
density of interconnects means that there are many pos- 
sible paths between any pair of machines. The challenge 
is: how can we ensure that the load is efficiently dis- 
tributed, no matter the traffic pattern? 

One obvious benefit of any sort of multipath TCP in 
data centers is that it can alleviate bottlenecks at the host 
NICs. For example in BCube, Fig.11(b), if the core is 
lightly loaded and a host has a single large flow then it 
makes sense to use both available interfaces. 

Multipath TCP is also beneficial when the network 
core is the bottleneck. To show this, we compared mul- 
tipath TCP to single-path TCP with Equal Cost Mul- 
tipath (ECMP), which we simulated by making each 
TCP source pick one of the shortest-hop paths at ran- 
dom. We ran packet-level simulations of FatTree with 
128 single-interface hosts and 80 eight-port switches, 
and for each pair of hosts we selected 8 paths at ran- 
dom to use for multipath. (Our reason for choosing 8 
paths is discussed below.) We also simulated BCube 
with 125 three-interface hosts and 25 five-port switches, 
and for each pair of hosts we selected 3 edge-disjoint 
paths according to the BCube routing algorithm, choos- 
ing the intermediate nodes at random when the algo- 
rithm needed a choice. All links were 1OOMb/s. 

We simulated three traffic patterns, all consisting of 
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Figure 11: Two proposed data center topologies. The bold lines show multiple paths between the source and 
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Figure 12: Multipath needs 8 paths 
to get good utilization in FatTree 


Figure 13: Distribution of throughput and loss rate, in 128-node 
FatTree 


long-lived flows. TP1 is a random permutation where 
each host opens a flow to a single destination chosen 
uniformly at random, such that each host has a single 
incoming flow. For FatTree, this is the least amount of 
traffic that can fully utilize the network and is a good 
test for overall utilization. In TP2 each host opens 12 
flows to 12 destinations; in FatTree the destinations are 
chosen at random, while in BCube they are the host’s 
neighbours in the three levels. This mimics the locality 
of communication of writes in a distributed filesystem, 
where replicas of a block may be placed close to each 
other in the physical topology in order to allow higher 
throughput. We are using a high number of replicas as 
a stress-test of locality. Finally, TP3 is a sparse traffic 
pattern: 30% of the hosts open one flow to a single des- 
tination chosen uniformly at random. 


FatTree simulations. The per-host throughputs ob- 
tained in FatTree in Mb/s, are: 


TP1 TP2 TP3 
SINGLE-PATH 51 94 60 
EWTCP 92 92.5 99 
MPTCP 95 97 99 


These figures show that for all three traffic patterns, 
both EWTCP and MPTCP have enough path diversity 
to ‘find’ nearly all the capacity in the network, as we can 
see from the fact that they get close to full utilization 
of the machine’s 100Mb/s interface card. Fig.12 shows 
the throughput achieved as a function of paths used, for 
MPTCP under TP1—we have found that 8 is enough 
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to get 90% utilization, in simulations across a range of 
traffic matrices and with thousands of hosts. 

Average throughput figures do not give the full pic- 
ture. Fig.13 shows the distribution of throughput on 
each flow, and of loss rate on each link, obtained by 
the three algorithms, for traffic pattern TP1. We see that 
MPTCP does a better job of allocating throughput fairly 
than EWTCP, for the reasons discussed in 82.2 and 83. 
Fairness matters for many datacenter distributed com- 
putations that farm processing out to many nodes and 
are limited by the response time of the slowest node. 
We also see that MPTCP does a better job of balancing 
congestion. 


BCube simulations. The per-host throughputs obtained 
in BCube, in Mb/s, are: 
TP1 TP2 TP3 
SINGLE-PATH 64.5 297 78 
EWTCP 84 229 139 
MPTCP 86.5 272 135 


These throughput figures reflect three different phe- 
nomena. First, both multipath algorithms allow a host 
to use all three of its interfaces whereas single-path TCP 
can use only one, so they allow higher throughput. This 
is clearest in the sparse traffic pattern TP3, where the 
network core is underloaded. Second, BCube paths may 
have different hop counts, hence they are likely to tra- 
verse different numbers of bottlenecks, so some paths 
will be more congested than others. As discussed in 
62.2, an efficient multipath algorithm should shift its 
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Figure 14: A multipath 
flow competing against 


two single-path flows WiFi and link 2 is 3G. 


traffic away from congestion, and EWTCP does not do 
this hence it tends to get worse throughput than MPTCP. 
This is especially clear in TP2, and not noticeable in 
TP3 where the core has little congestion. Third, even 
MPTCP does not move all its traffic away from the 
most congested path, for the reasons discussed in 82.4, 


Figure 15: Multipath TCP throughput 
compared to single-path, where link 1 is 


so when the least-congested paths happen to all be shortest- 


hop paths then shortest-hop single-path TCP will do bet- 
ter. This is what happened in TP2. (Of course it is not al- 
ways true that the least congested paths are all shortest- 
hop paths, so shortest-hop single-path TCP does poorly 
in other cases.) 


In summary, MPTCP performs well across a wide 
range of traffic patterns. In some cases EWTCP achieves 
throughput as good as MPTCP, and in other cases it 
falls short. Even when its average throughput is as good 
as MPTCP it is less fair. 

We have compared multipath TCP to single-path TCP, 
assuming that the single path is chosen at random from 
the shortest-hop paths available. Randomization goes 
some way towards balancing traffic, but it is likely to 
cause some congestion hotspots. An alternative solu- 
tion for balancing traffic is to use a centralized scheduler 
which monitors large flows and solves an optimization 
problem to calculate good routes for them [3]. We have 
found that, in order to get comparable performance to 
MPTCP, one may need to re-run the scheduler as of- 
ten as every 100ms [22] which raises serious scalability 
concerns. However, the exact numbers depend on the 
dynamics of the traffic matrix. 


5. MULTIPATH WIRELESS CLIENT 


Modern mobile phones and devices such as Nokia’s 
N900 have multiple wireless interfaces such as WiFi and 
3G, yet only one of them is used for data at any given 
time. With more and more applications requiring Inter- 
net access, from email to navigation, multipath can im- 
prove mobile users’ experience by allowing simultane- 
ous use of both interfaces. This shields the user from the 


Figure 16: The ratio of flow M/’s 
throughput to the better of flow $4 
and S», as we vary link 2 in Fig.14. 


inherently variable connectivity of wireless networks. 

3G and WiFi have quite different link characteristics. 
WiFi provides much higher throughput and short RTTs, 
but in our tests its performance was very variable with 
quite high loss rates, because there was significant in- 
terference in the 2.4GHz band. 3G tends to vary over 
longer timescales, and we found it to be overbuffered 
leading to RTTs of well over a second. These differ- 
ences provide a good test of the fairness goals and RTT 
compensation algorithm developed in 82.5. The exper- 
iments we describe here show that MPTCP gives users 
at least as much throughput as single-path users, and 
that the other multipath algorithms we have described 
do worse. 


Single-flow experiment. Our first experiments use a 
laptop equipped with a 3G USB interface and a 802.11 
network adapter, running our Linux implementation of 
MPTCP. The laptop was placed in the same room as the 
WiFi basestation, and 3G reception was good. The lap- 
top did not move, so the path characteristics were rea- 
sonably static. We ran 15 tests of 20 seconds each: 5 
with single-path TCP on WiFi, 5 with single-path TCP 
on 3G, and 5 with MPTCP. The average throughputs 
(with standard deviations) were 14.4 (0.2), 2.1 (0.2) and 
17.3 (0.7) Mb/s respectively. As we would wish, the 
MPTCP user gets bandwidth roughly equal to the sum 
of the bandwidths of the access links. 


Competing-flows experiment. We repeated the exper- 
iment, but now with competing single-path TCP flows 
on each of the paths, as in Fig.14. In order to showcase 
our algorithm for RTT compensation we repeated the 
experiment but replacing MPTCP first with EWTCP 
and then with COUPLED. The former does not have any 
RTT compensation built in, although the technique we 
used for MPTCP could be adapted. For the latter, we 
do not know how to build in RTT compensation. 

Fig.15 shows the total throughput obtained by each 
of the three flows over the course of 5 minutes, one plot 
for each of the three multipath algorithms. The top half 
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of the figure shows the bandwidth achieved on the WiFi 
path, the bottom half shows (inverted) the throughput 
on the 3G path, and the range of the grey area extend- 
ing into both halves shows the throughput the multipath 
algorithms achieved on both paths. 

The figure shows that only MPTCP gives the multi- 
path flow a fair total throughput, i.e. approximately as 
good as the better of the single-path competing flows, 
which in this case is the WiFi flow. The pictures are 
somewhat choppy: it seems that the WiFi basestation is 
underbuffered, hence the TCP sawtooth leads to peaks 
and troughs in throughput as measured at the receiver; it 
also seems the 3G link has bursts of high speed, perhaps 
triggered by buffer buildup. Despite these experimental 
vicissitudes, the long-run averages show that MPTCP 
does a much better job of getting fair total throughput. 
The long-run average throughputs in Mb/s, over 5 min- 
utes of each setup, are: 


multipath TCP-WiFi TCP-3G 
EWTCP 1.66 3.11 1.20 
COUPLED 1.41 3.49 0.97 
MPTCP 221 2.56 0.65 


These numbers match the predictions in 82.3. COU- 
PLED sends all its traffic on the less congested path so 
it often chooses to send on the 3G path and hardly uses 
the WiFi path. EWTCP splits its traffic so it gets the 
average of WiFi and 3G throughput. Only MPTCP gets 


close to the correct total throughput. The shortfall (2.21 Mb/s 


for MPTCP compared to 2.56Mb/s for the best single- 
path TCP) may be due to difficulty in adapting to the 
rapidly changing 3G link speed; we continue to investi- 
gate how quickly multipath TCP should adapt to changes 
in congestion. 


Simulations. In order to test RTT compensation across 
a wider range of scenarios, we simulated the topology 
in Fig.14 with two wired links, with capacities C) = 
250pkt/s and Cg = 500pkt/s, and propagation delays 
RTT; = 500ms and RTT2 = 5O0ms. At first sight we 
might expect each flow to get 250pkt/s. The simulation 
outcome is very different: flow S$; gets 130pkt/s, flow 
So gets 315pkt/s and flow M gets 305pkt/s; the drop 
probabilities are pj = 0.22% and po = 0.28%. Af- 
ter some thought we realize this outcome is very nearly 
what we designed the algorithm to achieve. As dis- 
cussed in §2.5, flow M says ‘What would a single-path 
TCP get on path 2, based on the current loss rate? I 
should get at least as much!’ and decides its throughput 
should be around 315pkt/s. It doesn’t say ‘What would 
a single-path TCP get on path 2 if I used only path 2?’ 
which would give the answer 250pkt/s. The issue is that 
the multipath flow does not take account of how its ac- 
tions would affect drop probabilities when it decides on 
its fair rate. It is difficult to see any practical alternative. 
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Figure 17: Throughput of multipath and regular 
TCP running simultaneously over 3G and WiFi. 
The 3G graph is shown inverted, so the total multi- 
path throughput (the grey area) can be seen clearly. 


And nonetheless, the outcome in this case is still better 
for both S; and M than if flow M used only link 1, and 
it is better for both So and / than if flow / used only 
link 2. 

We repeated the experiment, but with C; = 400pkt/s, 
RTT; = 100ms, and a range of values of C2 (shown as 
labels in Fig.16) and RTT. (the horizontal axis). Flow 
M aims to do as well as the better of flows S; and So. 
Fig.16 shows it is within a few percent of this goal in all 
cases except where the bandwidth delay product on link 
2 is very small; in such cases there are problems due to 
timeouts. Over all of these scenarios, flow M always 
gets better throughput by using multipath than if it used 
just the better of the two links; the average improvement 
is 15%. 


Mobile experiment. Having shown that our RTT com- 
pensation algorithm works in a rather testing wireless 
environment, we now wish to see how MPTCP performs 
when the client is mobile and both 3G and WiFi con- 
nectivity are intermittent. We use the same laptop and 
server as in the static experiment, but now the laptop 
user moves between floors of the building. The building 
has reasonable WiFi coverage on most floors but not on 
the staircases. 3G coverage is acceptable but is some- 
times heavily congested by other users. 

The experiment starts with one TCP running over the 
3G interface and one over WiFi, both downloading data 
from an otherwise idle university server. A multipath 
flow then starts, using both interfaces, downloading data 
from the same server. Fig.17 shows the throughputs over 
each link (each point is a 5s average). Again, WiFi 1s 
shown above the dashed line, 3G is shown inverted be- 
low the dashed line, and the total throughput of the mul- 
tipath flow can be clearly seen from the vertical range of 
the gray region. 
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During the experiment the subject moves around the 
building. For the first 9 minutes the 3G path has less 
congestion, so MPTCP would prefer to send its traffic 
on that route. But it also wants to get as much through- 
put as the higher-throughput path, in this case WiFi. The 
fairness algorithm prevents it from sending this much 
traffic on the 3G path, so as not to out-compete other 
single path TCPs that might be using 3G, and so the re- 
mainder is sent on WiFi. At 9 minutes the subject walks 
downstairs to go to a coffee machine. On the stairwell 
there is no WiFi coverage, but 3G coverage is better, so 
MPTCP adapts and takes advantage. When the subject 
leaves the stairwell, a new WiFi basestation is acquired, 
and multipath quickly takes advantage of it. This single 
trace shows the robustness advantage of multipath TCP, 
and it also shows that it does a good job of utilizing dif- 
ferent links simultaneously without harming competing 
traffic on those links. 


6. PROTOCOL IMPLEMENTATION 


Although this paper primarily focuses on the conges- 
tion control dynamics of MPTCP, the protocol changes 
to TCP needed to implement multipath can be quite sub- 
tle. In particular, we must to be careful to avoid dead- 
lock in a number of scenarios, especially relating to buffer 
management and flow control. In fact we discovered 
there is little choice in many aspects of the design. There 
are also many tricky issues regarding middleboxes which 
further constrain the design, not described here. A more 
complete exposition of these constraints can be found 
in [21], and our protocol is precisely described in the 
current mptcp draft [7]. 


Subflow establishment. Our implementation of MPTCP 


requires both client and server to have multipath exten- 
sions. A TCP option in the SYN packets of the first sub- 
flow is used to negotiate the use of multipath if both ends 
support it, otherwise they fall back to regular TCP be- 
havior. After this, additional subflows can be initiated; 
a TCP option in the SYN packets of the new subflows 
allows the recipient to tie the subflow into the existing 
connection. We rely on multiple interfaces or multiple 
IP addresses to obtain different paths; we have not yet 
studied the question of when additional paths should be 
started. 


Loss Detection and Stream Reassembly. Rgeular TCP 
uses a single sequence space for both loss detection and 
reassembly of the application data stream. With MPTCP, 
loss is a subflow issue, but the application data stream 
spans all subflows. To accomplish both goals using a 
single sequence space, the sequence space would need 
to be striped across the subflows. To detect loss, the 
receiver would then need to use selective acknowledg- 


ments and the sender would need to keep a scoreboard 
of which packets were sent on each subflow. Retrans- 
mitting packets on a different subflow creates an ambi- 
guity, but the real problem is middleboxes that are un- 
aware of MPTCP traffic. For example, the pf[19] fire- 
wall can re-write TCP sequence numbers to improve the 
randomness of the initial sequence number. If only one 
of the subflows passes through such a firewall, the re- 
ceiver cannot reliably reconstruct the data stream. 

To avoid such issues, we separated the two roles of 
sequence numbers. The sequence numbers and cumu- 
lative ack in the TCP header are per-subflow, allowing 
efficient loss detection and fast retransmission. Then to 
permit reliable stream reassembly, an additional data se- 
quence number is added stating where in the application 
data stream the payload should be placed. 


Flow Control. TCP’s flow control is implemented via 
the combination of the receive window field and the ac- 
knowledgment field in the TCP packet header. The re- 
ceive window indicates the number of bytes beyond the 
acknowledged sequence number that the receiver can 
buffer. The sender is not permitted to send more than 
this amount of additional data. 

Multipath TCP also needs to implement flow control, 
although packets now arrive over multiple subflows. Two 
choices seem feasible: 


e separate buffer pools are maintained at the receiver 
for each subflow, and their occupancy is signalled 
relative to the subflow sequence space using the re- 
ceive window field. 


e a single buffer pool is maintained at the receiver, 
and its occupancy is signalled relative to the data se- 
quence space using the receive window field. 


Unfortunately the former suffers from potential dead- 
lock. Suppose subflow 1 stalls due to an outage, but 
subflow 2’s receive buffer fills up. The packets received 
from subflow 2 cannot be passed to the application be- 
cause a packet from subflow | is still missing, but there 
is no space in subflow 2’s receive window to resend the 
packet from subflow | that is missing. To avoid this we 
use a single shared buffer; all subflows report the receive 
window relative to the last consecutively received data 
in the data sequence space. 

Does the data cumulative ack then need to be explicit, 
or can it be inferred from subflow acks by keeping track 
of which data corresponds to which subflow sequence 
numbers? 

Consider the following scenario: a receiver has suffi- 
cient buffering for two packets®. In accordance with the 
receive window, the sender sends two packets; data seg- 
ment | is sent on subflow | with subflow sequence num- 


°The same issue occurs with larger buffers 
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ber 10, and data segment 2 is sent on subflow 2 with sub- 

flow sequence number 20. The receiver acknowledges 

the packets using subflow sequence numbers only; the 
sender will infer which data is being acknowledged. Ini- 

tially, the inferred cumulative ack is 0. 

1. In the Ack for 10, the receiver acks data 1 in or- 
der, but the receiving application has not yet read the 
data, so relative to 1, the receive window is closed to 
1 packet. 

ul. In the Ack for 20, the receiver acks data 2 in order. 
As the application still has not read, relative to 2 the 
receive window is now zero. 

i. Unfortunately the acks are reordered simply because 
the RTT on path 2 is shorter than that on path 1, a 
common event. The sender receives the Ack for 20, 
infers that 2 has been received but | has not. The 
data cumulative ack is therefore still 0. 

iv. When the ack for 10 arrives, the receiver infers that 
1 and 2 have been received, so the data cumulative 
ack is now 2. The receive window indicated is 1 
packet, relative to the inferred cumulative ack of 2. 
Thus the sender can send packet 3. Unfortunately, 
the receiver cannot buffer 3 and must drop it. 

In general, the problem is that although it is possible 

to infer a data cumulative ack from the subflow acks, 

it is not possible to reliably infer the trailing edge of 
the receive window. The result is either missed sending 

Opportunities or dropped packets. This is not a corner 

case; it will occur whenever RTTs differ so as to cause 

the acks to arrive in a different order from that in which 
they were sent. 

To avoid this problem (and some others related to 
middleboxes) we add an explicit data acknowledgment 
field in addition to the subflow acknowledgment field in 
the TCP header. 


Encoding. How should be data sequence numbers and 
data acknowledgments be encoded in TCP packets? Two 
mechanisms seemed feasible: carry them in TCP op- 
tions or embed them in the payload using an SSL-like 
chunking mechanism. For data sequence numbers there 
is no compelling reason to choose one or the other, but 
for data acknowledgements the situation is more com- 
plex. 

For the sake of concreteness, let us assume that a hy- 
pothetical payload encoding uses a chunked TLV struc- 
ture, and that a data ack is contained in its own chunk, 
interleaved with data chunks flowing in the same direc- 
tion. As data acks are now part of the data stream, they 
are subject to congestion control and flow control. This 
can lead to potential deadlock scenarios. 

Consider a scenario where A’s receive buffer is full 
because the application has not read the data, but A’s ap- 
plication wishes to send data to B whose receive buffer 
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is empty. This might occur for example when B is pipelin- 
ing requests to A, and A now needs to send the response 
to an earlier request to B before reading the next request. 

A sends its data, B stores it locally, and wants to send 
the data ACK, but can’t do so: flow control imposed by 
A’s receive window stops him. Because no data acks are 
received from B, A cannot free its send buffer, so this 
fills up and blocks the sending application on A. The 
connection 1s now deadlocked. A’s application will only 
read when it has finished sending data to B, but it cannot 
do so because his send buffer is full. The send buffer can 
only empty when A receives an data ack from B, but B 
cannot send a data ack until A’s application reads. This 
is a classic deadlock cycle. 

In general, flow control of acks seems to be danger- 
ous. Our implementation conveys data acks using TCP 
options to avoid this and similar issues. Given this choice, 
we also encode data sequence numbers in TCP options. 


7. RELATED WORK 


There has been a good deal of work on building mul- 
tipath transport protocols [13, 27, 18, 12, 14, 6, 23, 7]. 
Most of this work focuses on the protocol mechanisms 
needed to implement multipath transmission, with key 
goals being robustness to long term path failures and to 
short term variations in conditions on the paths. The 
main issues are what we discussed in 86: how to split 
sequence numbers across paths (i.e. whether to use one 
sequence space for all subflows or one per subflow with 
an extra connection-level sequence number), how to do 
flow control (subflow, connection level or both), how to 
ack, and so forth. Our protocol design in 86 has drawn 
on this literature. 

However, the main focus of this paper is congestion 
control not protocol design. In most existing proposals, 
the problem of shared bottlenecks (§2.1) is considered 
but the other issues in 82 are not. Let us highlight the 
congestion control characteristics of these proposals. 

pTCP [12], CMT over SCTP[14] and M/TCP [23] use 
uncoupled congestion control on each path, and are not 
fair to competing single-path traffic in the general case. 

mTCP [27] also performs uncoupled congestion con- 
trol on each path. In an attempt to detect shared conges- 
tion at bottlenecks it computes the correlation between 
fast retransmit intervals on different subflows. It is not 
clear how robust this detector is. 

R-MTP [18] targets wireless links: it probes the band- 
width available periodically for each subflow and ad- 
justs the rates accordingly. To detect congestion it uses 
packet interarrival times and jitter, and infers mounting 
congestion when it observes increased jitter. This only 
works when wireless links are the bottleneck. 

The work in [11] is based on using EWTCP with dif- 
ferent weights on each path, and adapting the weights to 
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achieve the outcomes described in §2.1—82.2. It does not 
address the problems identified in 82.3—82.5, and in par- 
ticular it has problems coping with heterogenous RTTs. 


Network layer multipath. ECMP[25] achieves load 
balancing at the flow level, without the involvement of 
end-systems. It sends all packets from a given flow 
along the same route in order that end-systems should 
not see any packet re-ordering. ECMP and multipath 
TCP complement each other. Multipath TCP can use 
ECMP to get different paths through the network with- 
out having multihomed endpoints. Different subflows of 
the same multipath connection will have different five- 
tuples (at least one port will differ) and will likely hash 
onto a different path with ECMP. This interaction can 
be readily used in data centers, where multiple paths are 
available and ECMP is widely used. 

Horizon [20] is a system for load balancing at the net- 
work layer, for wireless mesh networks. Horizon net- 
work nodes maintain congestion state and estimated de- 
lay for each possible path towards the destination; hop- 
by-hop backpressure is applied to achieve near-optimal 


throughput, and the delay estimates let it avoid re-ordering. 


Theoretical work suggests that inefficient outcomes may 
arise when both the end-systems and the network partic- 
ipate in balancing traffic [1]. 


Application layer multipath. BitTorrent [4] is an ex- 
ample of application layer multipath. Different chunks 
of the same file are downloaded from different peers to 
increase throughput. BitTorrent works at chunk granu- 
larity, and only optimizes for throughput, downloading 
more chunks from faster servers. Essentially BitTorrent 
is behaving in a similar way to uncoupled multipath con- 
gestion control, albeit with the paths having different 
endpoints. While uncoupled congestion control does not 
balance flow rates, it nevertheless achieves some degree 
of load balancing when we take into account flow sizes 
[17, 26], by virtue of the fact that the less congested sub- 
flow gets higher throughput and therefore fewer bytes 
are put on the more congested subflow. 


8. CONCLUSIONS & FUTURE WORK 


We have demonstrated a working multipath conges- 
tion control algorithm. It brings immediate practical 
benefits: in 85 we saw it seamlessly balance traffic over 
3G and WiFi radio links, as signal strength faded in and 
out. It is safe to use: the fairness mechanism from 82.5 
ensures that it does not harm other traffic, and that there 
is always an incentive to turn it on because its aggregate 
throughput is at least as good as would be achieved on 
the best of its available paths. It should be beneficial 
to the operation of the Internet, since it selects efficient 
paths and balances congestion, as described in §2.2 and 


demonstrated in §3, at least in so far as it can given topo- 
logical constraints and the requirements of fairness. 

We believe our multipath congestion control algorithm 
is safe to deploy, either as part of the IETF’s efforts to 
standardize Multipath TCP[7] or with SCTP, and it will 
perform well. This is timely, as the rise of multipath- 
capable smart phones and similar devices has made it 
crucial to find a good way to use multiple interfaces 
more effectively. Currently such devices use heuristics 
to periodically choose the best interface, terminating ex- 
isting connections and re-establishing new ones each time 
a switch is made. Combined with a transport protocol 
such as Multipath TCP or SCTP, our congestion control 
mechanism avoids the need to make such binary deci- 
sions, but instead allows continuous and rapid rebalanc- 
ing on short timescales as wireless conditions change. 

Our congestion control scheme is designed to be com- 
patible with existing TCP behavior. However, existing 
TCP has well-known limitations when coping with long 
high-speed paths. To this end, Microsoft incorporate 
Compound TCP[24] in Vista and Windows 7, although 
it is not enabled by default, and recent Linux kernels 
use Cubic TCP[9]. We believe that Compound TCP 
should be a very good match for our congestion con- 
trol algorithm. Compound TCP kicks in when a link 
is underutilized to rapidly fill the pipe, but it falls back 
to NewReno-like behavior once a queue starts to build. 
Such a delay-based mechanism would be complemen- 
tary to the work described in this paper, but would fur- 
ther improve a multipath TCP’s ability to switch to a 
previously congested path that suddenly has spare ca- 
pacity. We intend to investigate this in future work. 
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Appendix 


We now prove that the equilibrium window sizes of MPTCP 
satisfy the fairness goals in 82.5. The rough intuition is 
that if we use SEMICOUPLED from 82.4, and addition- 
ally ensure (4), then the set of bottlenecked paths in- 
creases as a increases. The proof involves identifying 
the order in which paths become bottlenecked, to permit 

an analysis similar to §2.5. 

First define 


max;eg VW, /RTT, 
dregs Wr/ ATT, 


and assume for convenience that the window sizes are 
kept in the order 


i(S) = 











Note that with this ordering, the equilibrium window in- 
crease (1) reduces to 


Wmax($) [PTT (8 
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i.e. it can be computed with a linear search not a combi- 
natorial search. 

At equilibrium, assuming drop probabilities are small 
so 1 —p, & 1, the window sizes satisfy the balance 
equations 


in i(S)? = pp&,/2 foreachr € R. 
amin, i(5) pr, /2 for each r 


Rearranging this, and writing it in terms of w, = ,/2/p,, 
i, = uy max 1/105). (7) 
re 


Now take any 7’ C R. Rearranging the definition of 
i(T’), and applying some simple algebra, and substitut- 
ing in (7), 
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Since T’ was arbitrary, this proves we satisfy (4). 
To prove (3), applying (7) at r = n in conjunction 
with the ordering on window sizes, we get 


“TCP 
wre 


or 
aia : RET; 
One can also show that for all r, wh°P /RTT, < wre? /RTT»; 


the proof is by induction on r starting at r = n, and is 
omitted. These two facts imply (3). 
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Abstract 


This paper introduces CIEL, a universal execution en- 
gine for distributed data-flow programs. Like previous 
execution engines, CIEL masks the complexity of dis- 
tributed programming. Unlike those systems, a CIEL job 
can make data-dependent control-flow decisions, which 
enables it to compute iterative and recursive algorithms. 

We have also developed Skywriting, a Turing- 
complete scripting language that runs directly on CIEL. 
The execution engine provides transparent fault toler- 
ance and distribution to Skywriting scripts and high- 
performance code written in other programming lan- 
guages. We have deployed CIEL on a cloud computing 
platform, and demonstrate that it achieves scalable per- 
formance for both iterative and non-iterative algorithms. 


1 Introduction 


Many organisations have an increasing need to process 
large data sets, and a cluster of commodity machines on 
which to process them. Distributed execution engines— 
such as MapReduce [18] and Dryad [26]—have become 
popular systems for exploiting such clusters. These sys- 
tems expose a simple programming model, and auto- 
matically handle the difficult aspects of distributed com- 
puting: fault tolerance, scheduling, synchronisation and 
communication. MapReduce and Dryad can be used to 
implement a wide range of algorithms [3, 39], but they 
are awkward or inefficient for others [12, 21, 25, 28, 34]. 
The problems typically arise with iterative algorithms, 
which underlie many machine-learning and optimisation 
problems, but require a more expressive programming 
model and a more powerful execution engine. To address 
these limitations, and extend the benefits of distributed 
execution engines to a wider range of applications, we 
have developed Skywriting and CIEL. 

Skywriting is a scripting language that allows the 
straightforward expression of iterative and recursive 


task-parallel algorithms using imperative and functional 
language syntax [31]. Skywriting scripts run on CIEL, 
an execution engine that provides a universal execu- 
tion model for distributed data-flow. Like previous sys- 
tems, CIEL coordinates the distributed execution of a set 
of data-parallel tasks arranged according to a data-flow 
DAG, and hence benefits from transparent scaling and 
fault tolerance. However CIEL extends previous mod- 
els by dynamically building the DAG as tasks execute. 
As we will show, this conceptually simple extension— 
allowing tasks to create further tasks—enables CIEL to 
support data-dependent iterative or recursive algorithms. 
We present the high-level architecture of CIEL in Sec- 
tion 3, and explain how Skywriting maps onto CIEL’s 
primitives in Section 4. 

Our implementation incorporates several additional 
features, described in Section 5. Like existing systems, 
CIEL provides transparent fault tolerance for worker 
nodes. Moreover, CIEL can tolerate failures of the cluster 
master and the client program. To improve resource util- 
isation and reduce execution latency, CIEL can memoise 
the results of tasks. Finally, CIEL supports the streaming 
of data between concurrently-executing tasks. 

We have implemented a variety of applications in 
Skywriting, including MapReduce-style (grep, word- 
count), iterative (k-means, PageRank) and dynamic- 
programming (Smith-Waterman, option pricing) algo- 
rithms. In Section 6 we evaluate the performance of 
some of these applications when run on a CIEL cluster. 


2 Motivation 


Several researchers have identified limitations in the 
MapReduce and Dryad programming models. These 
systems were originally developed for batch-oriented 
jobs, namely large-scale text mining for information re- 
trieval [18, 26]. They are designed to maximise through- 
put, rather than minimise individual job latency. This is 
especially noticeable in iterative computations, for which 
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MapReduce Dryad 
Feature [2, 18] [26] 
Dynamic control flow x x 
Task dependencies Fixed (2-stage) Fixed (DAG) 
Fault tolerance Transparent Transparent 
Data locality J v 


Transparent scaling o v 





Pregel Iterative MR Piccolo CIEL 
[28] [12.21] [34] 

v v v o 
Fixed (BSP) Fixed (2-stage) _- Fixed (1-stage) Dynamic 
Transparent x Checkpoint Transparent 

v v v v 

v v x v 


Figure 1: Analysis of the features provided by existing distributed execution engines. 


multiple jobs are chained together and the job latency is 
multiplied [12, 21, 25, 28, 34]. 

Nevertheless, MapReduce—in particular its open- 
source implementation, Hadoop [2]—remains a pop- 
ular platform for parallel iterative computations with 
large inputs. For example, the Apache Mahout ma- 
chine learning library uses Hadoop as its execution en- 
gine [3]. Several of the Mahout algorithms—such as 
k-means clustering and singular value decomposition— 
are iterative, comprising a data-parallel kernel inside a 
while-not-converged loop. Mahout uses a driver pro- 
gram that submits multiple jobs to Hadoop and performs 
convergence testing at the client. However, since the 
driver program executes logically (and often physically) 
outside the Hadoop cluster, each iteration incurs job- 
submission overhead, and the driver program does not 
benefit from transparent fault tolerance. These problems 
are not unique to Hadoop, but are shared with both the 
original version of MapReduce [18] and Dryad [26]. 

The computational power of a distributed execution 
engine is determined by the data flow that it can express. 
In MapReduce, the data flow is limited to a bipartite 
graph parameterised by the number of map and reduce 
tasks; Dryad allows data flow to follow a more general 
directed acyclic graph (DAG), but it must be fully spec- 
ified before starting the job. In general, to support it- 
erative or recursive algorithms within a single job, we 
need data-dependent control flow—.e. the ability to cre- 
ate more work dynamically, based on the results of pre- 
vious computations. At the same time, we wish to retain 
the existing benefits of task-level parallelism: transparent 
fault tolerance, locality-based scheduling and transparent 
scaling. In Figure 1, we analyse a range of existing sys- 
tems in terms of these objectives. 

MapReduce and Dryad already support transparent 
fault tolerance, locality-based scheduling and transparent 
scaling [18, 26]. In addition, Dryad supports arbitrary 
task dependencies, which enables it to execute a larger 
class of computations than MapReduce. However, nei- 
ther supports data-dependent control flow, so the work in 
each computation must be statically pre-determined. 

A variety of systems provide data-dependent control 
flow but sacrifice other functionality. Google’s Pregel 
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is the largest-scale example of a distributed execution 
engine with support for control flow [28]. Pregel is a 
Bulk Synchronous Parallel (BSP) system designed for 
executing graph algorithms (such as PageRank), and 
Pregel computations are divided into “supersteps”, dur- 
ing which a “vertex method” is executed for each vertex 
in the graph. Crucially, each vertex can vote to terminate 
the computation, and the computation terminates when 
all vertices vote to terminate. Like a simple MapRe- 
duce job, however, a Pregel computation only operates 
on a single data set, and the programming model does 
not support the composition of multiple computations. 

Two recent systems add iteration capabilities to 
MapReduce. CGL-MapReduce is a new implementation 
of MapReduce that caches static (loop-invariant) data in 
RAM across several MapReduce jobs [21]. HaLoop ex- 
tends Hadoop with the ability to evaluate a convergence 
function on reduce outputs [12]. Neither system provides 
fault tolerance across multiple iterations, and neither can 
support Dryad-style task dependency graphs. 

Finally, Piccolo is a new programming model for data- 
parallel programming that uses a partitioned in-memory 
key-value table to replace the reduce phase of MapRe- 
duce [34]. A Piccolo program is divided into “kernel” 
functions, which are applied to table partitions in paral- 
lel, and typically write key-value pairs into one or more 
other tables. A “control” function coordinates the kernel 
functions, and it may perform arbitrary data-dependent 
control flow. Piccolo supports user-assisted checkpoint- 
ing (based on the Chandy-Lamport algorithm), and is 
limited to fixed cluster membership. If a single machine 
fails, the entire computation must be restarted from a 
checkpoint with the same number of machines. 

We believe that CIEL is the first system to support all 
five goals in Figure 1, but it is not a panacea. CIEL 
is designed for coarse-grained parallelism across large 
data sets, as are MapReduce and Dryad. For fine-grained 
tasks, a work-stealing scheme is more appropriate [11]. 
Where the entire data set can fit in RAM, Piccolo may 
be more efficient, because it can avoid writing to disk. 
Ultimately, achieving the highest performance requires 
significant developer effort, using a low-level technique 
such as explicit message passing [30]. 
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Figure 2: A CIEL job is represented by a dynamic task graph, which contains tasks and objects (3.1). In this example, 
root task A spawns tasks B, C and D, and delegates the production of its result to D. Internally, CIEL uses task and 


object tables to represent the graph (83.3). 


3. CIEL 


CIEL is a distributed execution engine that can execute 
programs with arbitrary data-dependent control flow. In 
this section, we first describe the core abstraction that 
CIEL supports: the dynamic task graph (83.1). We then 
describe how CIEL executes a job that is represented as 
a dynamic task graph (83.2). Finally, we describe the 
concrete architecture of a CIEL cluster that is used for 
distributed data-flow computing (83.3). 


3.1 Dynamic task graphs 


In this subsection, we define the three CIEL primitives— 
objects, references and tasks—and explain how they are 
related in a dynamic task graph (Figure 2). 

CIEL is a data-centric execution engine: the goal of 
a CIEL job is to produce one or more output objects. 
An object is an unstructured, finite-length sequence of 
bytes. Every object has a unique name: if two objects 
exist with the same name, they must have the same con- 
tents. To simplify consistency and replication, an object 
is immutable once it has been written, but it is sometimes 
possible to append to an object (85.3). 

It is helpful to be able to describe an object without 
possessing its full contents; CIEL uses references for this 
purpose. A reference comprises a name and a set of lo- 
cations (e.g. hostname-port pairs) where the object with 
that name is stored. The set of locations may be empty: 
in that case, the reference is a future reference to an ob- 
ject that has not yet been produced. Otherwise, it is a 
concrete reference, which may be consumed. 

A CIEL job makes progress by executing tasks. A 
task is a non-blocking atomic computation that executes 
completely on a single machine. A task has one or more 


dependencies, which are represented by references, and 
the task becomes runnable when all of its dependencies 
become concrete. The dependencies include a special 
object that specifies the behaviour of the task (such as an 
executable binary or a Java class) and may impose some 
structure over the other dependencies. To simplify fault 
tolerance (85.2), CIEL requires that all tasks compute a 
deterministic function of their dependencies. A task also 
has one or more expected outputs, which are the names of 
objects that the task will either create or delegate another 
task to create. 

Tasks can have two externally-observable behaviours. 
First, a task can publish one or more objects, by cre- 
ating a concrete reference for those objects. In particu- 
lar, the task can publish objects for its expected outputs, 
which may cause other tasks to become runnable if they 
depend on those outputs. To support data-dependent con- 
trol flow, however, a task may also spawn new tasks that 
perform additional computation. CIEL enforces the fol- 
lowing conditions on task behaviour: 


1. For each of its expected outputs, a task must either 
publish a concrete reference, or spawn a child task 
with that name as an expected output. This ensures 
that, as long as the children eventually terminate, 
any task that depends on the parent’s output will 
eventually become runnable. 


2. A child task must only depend on concrete refer- 
ences (i.e. objects that already exist) or future refer- 
ences to the outputs of tasks that have already been 
spawned (i.e. objects that are already expected to be 
published). This prevents deadlock, as a cycle can- 
not form in the dependency graph. 


The dynamic task graph stores the relation between 
tasks and objects. An edge from an object to a task means 
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that the task depends on that object. An edge from a task 
to an object means that the task is expected to output 
the object. As a job runs, new tasks are added to the 
dynamic task graph, and the edges are rewritten when a 
newly-spawned task is expected to produce an object. 

The dynamic task graph provides low-level data- 
dependent control flow that resembles tail recursion: a 
task either produces its output (analogous to returning a 
value) or spawns a new task to produce that output (anal- 
ogous to a tail call). It also provides facilities for data- 
parallelism, since independent tasks can be dispatched 
in parallel. However, we do not expect programmers 
to construct dynamic task graphs manually, and instead 
we provide the Skywriting script language for generating 
these graphs programmatically (84). 


3.2 Evaluating objects 


Given a dynamic task graph, the role of CIEL is to eval- 
uate one or more objects that correspond to the job out- 
puts. Indeed, a CIEL job can be specified as a single 
root task that has only concrete dependencies, and an 
expected output that names the final result of the com- 
putation. This leads to two natural strategies, which are 
variants of topological sorting: 


Eager evaluation. Since the task dependencies form a 
DAG, at least one task must have only concrete de- 
pendencies. Start by executing the tasks with only 
concrete dependencies; subsequently execute tasks 
when all of their dependencies become concrete. 


Lazy evaluation. Seek to evaluate the expected output 
of the root task. To evaluate an object, identify the 
task, 7’, that is expected to produce the object. If 7’ 
has only concrete dependencies, execute it immedi- 
ately; otherwise, block 7’ and recursively evaluate 
all of its unfulfilled dependencies using the same 
procedure. When the inputs of a blocked task be- 
come concrete, execute it. When the production of 
a required object is delegated to a spawned task, re- 
evaluate that object. 


When we first developed CIEL, we experimented with 
both strategies, but switched exclusively to lazy evalua- 
tion since it more naturally supports the fault-tolerance 
and memoisation features that we describe in 85. 


3.3. System architecture 


Figure 3 shows the architecture of a CIEL cluster. A sin- 
gle master coordinates the end-to-end execution of jobs, 
and several workers execute individual tasks. 

The master maintains the current state of the dynamic 
task graph in the object table and task table (Figure 2(b)). 
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Figure 3: A CIEL cluster has a single master and many 
workers. The master dispatches tasks to the workers for 
execution. After a task completes, the worker publishes 
a set of objects and may spawn further tasks. 


Each row in the object table contains the latest refer- 
ence for that object, including its locations (if any), and 
a pointer to the task that is expected to produce it (if any: 
an object will not have a task pointer if it is loaded into 
the cluster by an external tool). Each row in the task ta- 
ble corresponds to a spawned task, and contains pointers 
to the references on which the task depends. 

The master scheduler is responsible for making 
progress in a CIEL computation: it lazily evaluates out- 
put objects and pairs runnable tasks with idle workers. 
Since task inputs and outputs may be very large (on the 
order of gigabytes per task), all bulk data is stored on the 
workers themselves, and the master handles references. 
The master uses a multiple-queue-based scheduler (de- 
rived from Hadoop [2]) to dispatch tasks to the worker 
nearest the data. If a worker needs to fetch a remote ob- 
ject, it reads the object directly from another worker. 

The workers execute tasks and store objects. At 
startup, a worker registers with the master, and periodi- 
cally sends a heartbeat to demonstrate its continued avail- 
ability. When a task is dispatched to a worker, the ap- 
propriate executor is invoked. An executor is a generic 
component that prepares input data for consumption and 
invokes some computation on it, typically by executing 
an external process. We have implemented simple execu- 
tors for Java, .NET, shell-based and native code, as well 
as a more complex executor for Skywriting (84). 

Assuming that a worker executes a task successfully, 
it will reply to the master with the set of references that 
it wishes to publish, and a list of task descriptors for any 
new tasks that it wishes to spawn. The master will then 
update the object table and task table, and re-evaluate the 
set of tasks now runnable. 

In addition to the master and workers, there will be one 
or more clients (not shown). A client’s role is minimal: it 
submits a job to the master, and either polls the master to 
discover the job status or blocks until the job completes. 
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function process_chunk(chunk, prev_result) { 
// Execute native code for chunk processing. 
// Returns a reference to a partial result. 
return spawn_exec(...); 


} 


function is_converged(curr_result, prev_result) { 
// Execute native code for convergence test. 
// Returns a reference to a boolean. 
return spawn_exec(...) [0]; 


} 


input_data = [ref("ciel://host137/chunk0"), 
ref ("ciel://host223/chunk1"), 
cad 7 


curr = ...; // Initial guess at the result. 


do { 
prev = curr; 
curr = []; 
for (chunk in input_data) { 
curr += process_chunk(chunk, prev); 


} 


} while (!*1is_converged(curr, prev)); 


return curr; 


Figure 4: Iterative computation implemented in Skywrit- 
ing. input_data 1s a list of n input chunks, and curr 1s 
initialised to a list of n partial results. 


A job submission message contains a root task, which 
must have only concrete dependencies. The master adds 
the root task to the task table, and starts the job by lazily 
evaluating its output ($3.2). 

Note that CIEL currently uses a single (active) mas- 
ter for simplicity. Despite this, our implementation can 
recover from master failure (85.2), and it did not cause 
a performance bottleneck during our evaluation (86). 
Nonetheless, if it became a concern in future, it would be 
possible to partition the master state—i.e. the task table 
and object table—between several hosts, while retaining 
the functionality of a single logical master. 


4 Skywriting 


Skywriting is a language for expressing task-level paral- 
lelism that runs on top of CIEL. Skywriting is Turing- 
complete, and can express arbitrary data-dependent con- 
trol flow using constructs such as while loops and re- 
cursive functions. Figure 4 shows an example Skywrit- 
ing script that computes an iterative algorithm; we use a 
similar structure in the k-means experiment (86.2). 

We introduced Skywriting in a previous paper [31], 
but briefly restate the key features here: 


e ref(url) returns a reference to the data stored 
at the given URL. The function supports common 
URL schemes, and the custom ciel scheme, which 
accesses entries in the CIEL object table. If the URL 
is external, CIEL downloads the data into the cluster 
as an object, and assigns a name for the object. 


Skywriting script 
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(c) Implicit continuation due to dereferencing 


Figure 5: Task creation in Skywriting. Tasks can be cre- 
ated using (a) spawn (), (b) spawn_exec() and (c) the 
dereference («) operator. 


[arg, ...]) spawns a parallel task 
to evaluate f(arg, ...). Skywriting functions 
are pure: functions cannot have side-effects, and all 
arguments are passed by value. The return value is 
a reference to the result of f(arg, ...). 


@ spawn (f, 


@ exec(executor, args, n) synchronously runs 
the named executor with the given args. The ex- 
ecutor will produce n outputs. The return value is a 
list of n references to those outputs. 


@® spawn_exec(executor, args, n) spawns a 
parallel task to run the named executor with the 
given args. As with exec (), the return value is a 
list of n references to those outputs. 


e The dereference (unary-*) operator can be applied 
to any reference; it loads the referenced data into 
the Skywriting execution context, and evaluates to 
the resulting data structure. 


In the following, we describe how Skywriting maps on 
to CIEL primitives. We describe how tasks are cre- 
ated (84.1), how references are used to facilitate data- 
dependent control flow (84.2), and the relationship be- 
tween Skywriting and other frameworks (84.3). 
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4.1 Creating tasks 


The distinctive feature of Skywriting is its ability to 
spawn new tasks in the middle of executing a job. The 
language provides two explicit mechanisms for spawning 
new tasks (the spawn () and spawn_exec () functions) 
and one implicit mechanism (the x-operator). Figure 5 
summarises these mechanisms. 


The spawn () function creates a new task to run the 
given Skywriting function. To do this, the Skywriting 
runtime first creates a data object that contains the new 
task’s environment, including the text of the function to 
be executed and the values of any arguments passed to 
the function. This object is called a Skywriting continu- 
ation, because it encapsulates the state of a computation. 
The runtime then creates a task descriptor for the new 
task, which includes a dependency on the new continu- 
ation. Finally, it assigns a reference for the task result, 
which it returns to the calling script. Figure 5(a) shows 
the structure of the created task. 


The spawn_exec() function is a lower-level task- 
creation mechanism that allows the caller to invoke code 
written in a different language. Typically, this function is 
not called directly, but rather through a wrapper for the 
relevant executor (e.g. the built-in java () library func- 
tion). When spawn_exec() is called, the runtime seri- 
alises the arguments into a data object and creates a task 
that depends on that object (Figure 5(b)). If the argu- 
ments to spawn_exec() include references, the runtime 
adds those references to the new task’s dependencies, to 
ensure that CIEL will not schedule the task until all of 
its arguments are available. Again, the runtime creates 
references for the task outputs, and returns them to the 
calling script. We discuss how names are chosen in 85.1. 


If the task attempts to dereference an object that has 
not yet been created—for example, the result of a call 
to spawn ()—the current task must block. However, 
CIEL tasks are non-blocking: all synchronisation (and 
data-flow) must be made explicit in the dynamic task 
graph (83.1). To resolve this contradiction, the runtime 
implicitly creates a continuation task that depends on 
the dereferenced object and the current continuation (i.e. 
the current Skywriting execution stack). The new task 
therefore will only run when the dereferenced object has 
been produced, which provides the necessary synchro- 
nisation. Figure 5(c) shows the dependency graph that 
results when a task dereferences the result of spawn (). 

A task terminates when it reaches a return statement 
(or it blocks on a future reference). A Skywriting task has 
a single output, which is the value of the expression 1n the 
return Statement. On termination, the runtime stores 
the output in the local object store, publishes a concrete 
reference to the object, and sends a list of spawned tasks 
to the master, in order of creation. 
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Skywriting ensures that the dynamic task graph re- 
mains acyclic. A task’s dependencies are fixed when 
the task-creation function is evaluated, which means 
that they can only include references that are stored in 
the local Skywriting scope before evaluating the func- 
tion. Therefore, a task cannot depend on itself or any of 
its descendants. Note that the results of spawn() and 
spawn_exec () are first-class futures [24]: a Skywriting 
task can pass the references in its return value or in a sub- 
sequent call to the task-creation functions. This enables a 
script to create arbitrary acyclic dependency graphs, such 
as the MapReduce dependency graph (84.3). 


4.2 Data-dependent control flow 


Skywriting is designed to coordinate data-centric com- 
putations, which means that the objects in the computa- 
tion can be divided into two spaces: 


Data space. Contains large data objects that may be up 
to several gigabytes in size. 


Coordination space. Contains small objects—such as 
integers, booleans, strings, lists and dictionaries— 
that determine the control flow. 


In general, objects in the data space are processed by pro- 
grams written in compiled languages, to achieve better 
I/O or computational performance than Skywriting can 
provide. In existing distributed execution engines (such 
as MapReduce and Dryad), the data space and coordi- 
nation space are disjoint, which prevents these systems 
from supporting data-dependent control flow. 

To support data-dependent control flow, data must be 
able to pass from the data space into the coordination 
space, so that it can help to determine the control flow. 
In Skywriting, the «-operator transforms a reference to 
a (data space) object into a (coordination space) value. 
The producing task, which may be run by any executor, 
must write the referenced object in a format that Sky- 
writing can recognise; we use JavaScript Object Notation 
(JSON) for this purpose [4]. This serialisation format is 
only used for references that are passed to Skywriting, 
and the majority of executors use the appropriate binary 
format for their data. 


4.3. Other languages and frameworks 


Systems like MapReduce have become popular, at least 
in part, because of their simple interface: a developer can 
specify a whole distributed computation with just a pair 
of map() and reduce () functions. To demonstrate that 
Skywriting approaches this level of simplicity, Figure 6 
shows an implementation of the MapReduce execution 
model, taken from the Skywriting standard library. 
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function apply(f, list) { 


outputs = []; 
for (1 in range(len(list))) { 
eutsuts (1) =f (Last [ays 


} 


return outputs; 


} 


function shuffle(inputs, num_outputs) { 

outputs = []; 

for (1 in range(num_outputs)) { 
outputs[i] = []; 
for (j in range(len(inputs))) { 

outputs (a) li] = anputs (a) lal; 

} 

} 


return outputs; 


} 


function mapreduce(inputs, mapper, reducer, r) { 
map_outputs = apply(mapper, inputs); 
reduce_inputs = shuffle(map_outputs, r); 
reduce_outputs = apply(reducer, reduce_inputs); 
return reduce_outputs; 


Figure 6: Implementation of the MapReduce program- 
ming model in Skywriting. The user provides a list of 1n- 
puts, a mapper function, a reducer function and the num- 
ber of reducers to use. 


The mapreduce() function first applies mapper to 
each element of inputs. mapper is a Skywriting func- 
tion that returns a list of r elements. The map outputs 
are then shuffled, so that the z'" output of each map be- 
comes an input to the i" reduce. Finally, the reducer 
function is applied r times to the collected reduce in- 
puts. In typical use, the inputs to mapreduce () are data 
objects containing the input splits, and the mapper and 
reducer functions invoke spawn_exec() to perform 
computation in another language. 

Note that the mapper function is responsible for par- 
titioning data amongst the reducers, and the reducer 
function must merge the inputs that it receives. The im- 
plementation of mapper may also incorporate a com- 
biner, if desired [18]. To simplify development, we have 
ported portions of the Hadoop MapReduce framework to 
run as CIEL tasks, and provide helper functions for par- 
titioning, merging, and processing Hadoop file formats. 

Any higher-level language that is compiled into a DAG 
of tasks can also be compiled into a Skywriting pro- 
gram, and executed on a CIEL cluster. For example, 
one could develop Skywriting back-ends for Pig [32] 
and DryadLINQ [39], raising the possibility of extending 
those languages with support for unbounded iteration. 


5 Implementation issues 


The current implementation of CIEL and Skywriting 
contains approximately 9,500 lines of Python code, and 
a few hundred lines of C, Java and other languages in the 


executor bindings. All of the source code, along with a 
suite of example Skywriting programs (including those 
used to evaluate the system in 86), is available to down- 
load from our project website: 
http://www.cl.cam.ac.uk/netos/ciel/ 

The remainder of this section describes three interest- 
ing features of our implementation: memoisation (85.1), 
master fault tolerance (85.2) and streaming (85.3). 


5.1 Deterministic naming & memoisation 


Recall that all objects in a CIEL cluster have a unique 
name. In this subsection, we show how an appropriate 
choice of names can enable memoisation. 

Our original implementation of CIEL used globally- 
unique identifiers (UUIDs) to identify all data objects. 
While this was a conceptually simple scheme, it compli- 
cated fault tolerance (see following subsection), because 
the master had to record the generated UUIDs to support 
deterministic task replay after a failure. 

This motivated us to reconsider the choice of names. 
To support fault-tolerance, existing systems assume that 
individual tasks are deterministic [18, 26], and CIEL 
makes the same assumption (83.1). It follows that two 
tasks with the same dependencies—including the exe- 
cutable code as a dependency—will have identical be- 
haviour. Therefore the n outputs of a task created with 
the following Skywriting statement 


result = spawn_exec(executor, args, Nn); 


will be completely determined by executor, args, n 
and their indices. We could therefore construct a name 
for the 7 output by concatenating executor, args, 
n and 2, with appropriate delimiters. However, since 
args may itself contain references, names could grow 
to an unmanageable length. We therefore use a collision- 
resistant hash function, 7/, to compute a digest of args 
and n, which gives the resulting name: 


executor [= | Mlaresiin) [=] 7 


We currently use the 160-bit SHA-1 hash function to 
generate the digest. 

Recall the lazy evaluation algorithm from 83.2: tasks 
are only executed when their expected outputs are needed 
to resolve a dependency for a blocked task. If a new 
task’s outputs have already been produced by a previous 
task, the new task need not be executed at all. Hence, 
as a result of deterministic naming, CIEL memoises task 
results, which can improve the performance of jobs that 
perform repetitive tasks. 

The goals of our memoisation scheme are similar to 
the recent Nectar system [23]. Nectar performs static 
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analysis on DryadLINQ queries to identify subqueries 
that have previously been computed on the same data. 
Nectar is implemented at the DryadLINQ level, which 
enables it to make assumptions about the semantics of 
the each task, and the cost/benefit ratio of caching inter- 
mediate results. For example, Nectar can re-use the re- 
sults of commutative and associative aggregations from 
a previous query, if the previous query operated on a pre- 
fix of the current query’s input. The expressiveness of 
CIEL jobs makes it more challenging to run these analy- 
ses, and we are investigating how simple annotations in a 
Skywriting program could provide similar functionality 
in our system. 


5.2 Fault tolerance 


A distributed execution engine must continue to make 
progress in the face of network and computer faults. As 
jobs become longer—and, since CIEL allows unbounded 
iteration, they may become extremely long—the proba- 
bility of experiencing a fault increases. Therefore, CIEL 
must tolerate the failure of any machine involved in the 
computation: the client, workers and master. 

Client fault tolerance is trivial, since CIEL natively 
supports iterative jobs and manages job execution from 
start to finish. The client’s only role is to submit the 
job: if the client subsequently fails, the job will con- 
tinue without interruption. By contrast, in order to exe- 
cute an iterative job using a non-iterative framework, the 
client must run a driver program that performs all data- 
dependent control flow (such as convergence testing). 
Since the driver program executes outside the frame- 
work, it does not benefit from transparent fault tolerance, 
and the developer must provide this manually, for exam- 
ple by checkpointing the execution state. In our system, a 
Skywriting script replaces the driver program, and CIEL 
executes the whole script reliably. 

Worker fault tolerance in CIEL is similar to 
Dryad [26]. The master receives periodic heartbeat mes- 
sages from each worker, and considers a worker to have 
failed if (z) it has not sent a heartbeat after a specified 
timeout, and (ii) it does not respond to a reverse message 
from the master. At this point, if the worker has been 
assigned a task, that task is deemed to have failed. 

When a task fails, CIEL automatically re-executes it. 
However, if it has failed because its inputs were stored 
on a failed worker, the task is no longer runnable. In 
that case, CIEL recursively re-executes predecessor tasks 
until all of the failed task’s dependencies are resolved. 
To achieve this, the master invalidates the locations in 
the object table for each missing input, and lazily re- 
evaluates the missing inputs. Other tasks that depend on 
data from the failed worker will also fail, and these are 
similarly re-executed by the master. 
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Master fault tolerance is also supported in CIEL. In 
MapReduce and Dryad, a job fails completely if its mas- 
ter process fails [18, 26]; in Hadoop, all jobs fail if the 
JobTracker fails [2]; and master failure will usually cause 
driver programs that submit multiple jobs to fail. How- 
ever, in CIEL, all master state can be derived from the 
set of active jobs. At a minimum, persistently storing the 
root task of each active job allows a new master to be 
created and resume execution immediately. CIEL pro- 
vides three complementary mechanisms that extend mas- 
ter fault tolerance: persistent logging, secondary masters 
and object table reconstruction. 

When a new job is created, the master creates a log 
file for the job, and synchronously writes its root task 
descriptor to the log. By default, it writes the log to a log 
directory on local secondary storage, but it can also write 
to a networked file system or distributed storage service. 
As new tasks are created, their descriptors are appended 
asynchronously to the log file, and periodically flushed to 
disk. When the job completes, a concrete reference to its 
result is written to the log directory. Upon restarting, the 
master scans its log directory for jobs without a matching 
result. For those jobs, it replays the log, rebuilding the 
dynamic task graph, and ignoring the final record if it is 
truncated. Once all logs have been processed, the master 
restarts the jobs by lazily evaluating their outputs. 

Alternatively, the master may log state updates to a 
secondary master. After the secondary master registers 
with the primary master, the primary asynchronously for- 
wards all task table and object table updates to the sec- 
ondary. Each new job is sent synchronously, to ensure 
that it is logged at the secondary before the client re- 
ceives an acknowledgement. In addition, the secondary 
records the address of every worker that registers with the 
primary, so that it can contact the workers in a fail-over 
scenario. The secondary periodically sends a heartbeat to 
the primary; when it detects that the primary has failed, 
the secondary instructs all workers to re-register with it. 
We evaluate this scenario in 86.5. 

If the master fails and subsequently restarts, the work- 
ers can help to reconstruct the object table using the con- 
tents of their local object stores. A worker deems the 
master to have failed if it does not respond to requests. At 
this point, the worker switches into reregister mode, and 
the heartbeat messages are replaced with periodic regis- 
tration requests to the same network location. When the 
worker finally contacts a new master, the master pulls a 
list of the worker’s data objects, using a protocol based 
on GFS master recovery [22]. 


5.3. Streaming 


Our earlier definition of a task (83.1) stated that a task 
produces data objects as part of its result. This definition 


USENIX Association 


USENIX Association 


implies that object production is atomic: an object either 
exists completely or not at all. However, since data ob- 
jects may be very large, there is often the opportunity to 
stream the partially-written object between tasks, which 
can lead to pipelined parallelism. 

If the producing task has streamable outputs, it sends a 
pre-publish message to the master, containing stream ref- 
erences for each streamable output. These references are 
used to update the object table, and may unblock other 
tasks: the stream consumers. A stream consumer ex- 
ecutes as before, but the executed code reads its input 
from a named pipe rather than a local file. A separate 
thread in the consuming worker process fetches chunks 
of input from the producing worker, and writes them into 
the pipe. When the producer terminates successfully, it 
commits its outputs, which signals to the consumer that 
no more data remains to be read. 

In the present implementation, the stream producer 
also writes its output data to a local disk, so that, if 
the stream consumer fails, the producer is unaffected. If 
the producer fails while it has a consumer, the producer 
rolls back any partially-written output. In this case, the 
consumer will fail due to missing input, and trigger re- 
execution of the producer (85.2). We are investigating 
more sophisticated fault-tolerance and scheduling poli- 
cies that would allow the producer and consumer to com- 
municate via direct TCP streams, as in Dryad [26] and 
the Hadoop Online Prototype [16]. However, as we show 
in the following section, support for streaming yields 
useful performance benefits for some applications. 


6 Evaluation 


Our main goal in developing CIEL was to develop a sys- 
tem that supports a more powerful model of computa- 
tion than existing distributed execution engines, without 
incurring a high cost in terms of performance. In this 
section, we evaluate the performance of CIEL running a 
variety of applications implemented in Skywriting. We 
investigate the following questions: 


1. How does CIEL’s performance compare to a system 
in production use (viz. Hadoop)? (86.1, 86.2) 


2. What benefits does CIEL provide when executing 
an iterative algorithm? (86.2) 


3. What overheads does CIEL impose on compute- 
intensive tasks? (86.3, 86.4) 


4. What effect does master failure have on end-to-end 
job performance? (86.5) 


For our evaluation, we selected a set of algorithms to an- 
swer these questions, including MapReduce-style, iter- 
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Figure 7: Grep execution time on Hadoop and CIEL 


(§6.1). 


ative, and compute-intensive algorithms. We chose dy- 
namic programming algorithms to demonstrate CIEL’s 
ability to execute algorithms with data dependencies that 
do not translate to the MapReduce model. 

All of the results presented in this section were gath- 
ered using ml.small virtual machines on the Amazon 
EC2 cloud computing platform. At the time of writing, 
an m1.smal1 instance has 1.7 GB of RAM and | virtual 
core (equivalent to a 2007 AMD Opteron or Intel Xeon 
processor) [1]. In all cases, the operating system was 
Ubuntu 10.04, using Linux kernel version 2.6.32 in 32- 
bit mode. Since the virtual machines are single-core, we 
run one CIEL worker per machine, and configure Hadoop 
to use one map slot per TaskTracker. 


6.1 Grep 


Our grep benchmark uses the Grep example application 
from Hadoop to search a 22.1 GB dump of English- 
language Wikipedia for a three-character string. The 
original Grep application performs two MapReduce jobs: 
the first job parses the input data and emits the matching 
strings, and the second sorts the matching strings by fre- 
quency. In Skywriting, we implemented this as a single 
script that uses two invocations of mapreduce () (84.3). 
Both systems use identical data formats and execute an 
identical computation (regular expression matching). 
Figure 7 shows the absolute execution time for Grep 
as the number of workers increases from 10 to 100. Av- 
eraged across all runs, CIEL outperforms Hadoop by 
35%. We attribute this to the Hadoop heartbeat proto- 
col, which limits the rate at which TaskTrackers poll for 
tasks once every 5 seconds, and the mandatory “setup” 
and “cleanup” phases that run at the start and end of 
each job [38]. As a result, the relative performance of 
CIEL improves as the job becomes shorter: CIEL takes 
29% less time on 10 workers, and 40% less time on 100 
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Figure 8: Results of the k-means experiment on Hadoop and CIEL with 20 workers (86.2). 


workers. We observed that a no-op Hadoop job (which 
dispatches one map task per worker, and terminates im- 
mediately) runs for an average of 30 seconds. Since Grep 
involves two jobs, we would not expect Hadoop to com- 
plete the benchmark in less than 60 seconds. These re- 
sults confirm that Hadoop is not well-suited to short jobs, 
which is a result of its original application (large-scale 
document indexing). However, anecdotal evidence sug- 
gests that production Hadoop clusters mostly run jobs 
lasting less than 90 seconds [40]. 


6.2 


We ported the Hadoop-based k-means implementation 
from the Apache Mahout scalable machine learning 
toolkit [3] to CIEL. Mahout simulates iterative-algorithm 
support on Hadoop by submitting a series of jobs and 
performing a convergence test outside the cluster; our 
port uses a Skywriting script that performs all iterations 
and convergence testing in a single CIEL job. 

In this experiment, we compare the performance of the 
two versions by running 5 iterations of clustering on 20 
workers. Each task takes 64 MB of input—80,000 dense 
vectors, each containing 100 double-precision values— 
and k = 100 cluster centres. We increase the number of 
tasks from 20 to 100, in multiples of the cluster size. As 
before, both systems use identical data formats and exe- 
cute an identical computational kernel. Figure 8(a) com- 
pares the per-iteration execution time for the two ver- 
sions. For each job size, CIEL is faster than Hadoop, 
and the difference ranges between 113 and 168 seconds. 
To investigate this difference further, we now analyse the 
task execution profile. 

Figure 8(b) shows the cluster utilisation as a function 
of time for the 5 iterations of 100 tasks. From this fig- 
ure, we can compute the average cluster utilisation: 1.e. 
the probability that a worker is assigned a task at any 


k-means 
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point during the job execution. Across all job sizes, CIEL 
achieves 89 + 2% average utilisation, whereas Hadoop 
achieves 84% utilisation for 100 tasks (and only 59% 
utilisation for 20 tasks). The Hadoop utilisation drops to 
70% at several points when there is still runnable work, 
which is visible as troughs or “noise” in the utilisation 
time series. This scheduling delay is due to Hadoop’s 
polling-based implementation of task dispatch. 

CIEL also achieves higher utilisation in this experi- 
ment because the task duration is less variable. The 
execution time of k-means is dominated by the map 
phase, which computes k Euclidean distances for each 
data point. Figure 8(c) shows the cumulative distribution 
of map task durations, across all k-means experiments. 
The Hadoop distribution is clearly bimodal, with 64% 
of the tasks being “fast” (u = 130.9, 0 = 3.92) and 
36% of the tasks being “slow” (u = 193.5, 0 = 3.71). 
By contrast, all of the CIEL tasks are “fast” (u = 134.1, 
go = 5.05). On closer inspection, the slow Hadoop tasks 
are non-data-local: i.e. they read their input from an- 
other HDFS data node. When computing an iterative job 
such as k-means, CIEL can use information about previ- 
ous iterations to improve the performance of subsequent 
iterations. For example, CIEL preferentially schedules 
tasks on workers that consumed the same inputs in pre- 
vious iterations, in order to exploit data that might still 
be stored in the page cache. When a task reads its input 
from a remote worker, CIEL also updates the object table 
to record that another replica of that input now exists. By 
contrast, each iteration on Hadoop is an independent job, 
and Hadoop does not perform cross-job optimisations, so 
the scheduler is less able to exploit data locality. 

In the CIEL version, a Skywriting task performs a con- 
vergence test and, if necessary, spawns a subsequent it- 
eration of k-means. However, compared to the data- 
intensive map phase, its execution time 1s insignificant: 
in the 100-task experiment, less than 2% of the total job 
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Figure 9: Smith-Waterman (86.3) and BOPM (86.4) 
are dynamic programming algorithms, with macro-level 
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Figure 10: Smith-Waterman cluster utilisation against 
time, for different block granularities. The best perfor- 
mance is observed with 30 x 30 blocks. 


execution time is spent running Skywriting tasks. The 
Skywriting execution time is dominated by communica- 
tion with the master, as the script sends a new task de- 
scriptor to the master for each task in the new iteration. 


6.3 Smith-Waterman 


In this experiment, we evaluate strategies for paral- 
lelising the Smith-Waterman sequence alignment algo- 
rithm [36]. For strings of size m and n, the algorithm 
computes mn elements of a dynamic programming ma- 
trix. However, since each element depends on three 
predecessors, the algorithm is not embarrassingly par- 
allel. We divide the matrix into blocks—where each 
block depends on values from its three neighbours (Fig- 
ure 9(a))—and process one block per task. 

We use CIEL to compute the alignment between two 
1 MB strings on 20 workers. Figure 10 shows the clus- 
ter utilisation as the block granularity 1s varied: a gran- 
ularity of m x n means that the computation is split 
into mn blocks. For 10 x 10 (the most coarse-grained 
case that we consider), the maximum degree of paral- 
lelism is 10, because the dependency structure limits the 


Speedup 





0 20 40 60 10) 100 
Number of tasks 


Figure 11: Speedup of BOPM (86.4) on 47 workers as 
the number of tasks is varied and the resolution is in- 
creased. 


maximum achievable parallelism to the length of the an- 
tidiagonal in the block matrix. Increasing the number 
of blocks to 20 x 20 allows CIEL to achieve full util- 
isation briefly, but performance remains poor because 
the majority of the job duration is spent either ramp- 
ing up to or down from full utilisation. We observe the 
best performance for 30 x 30, which ramps up to full 
utilisation more quickly than coarser-grained configura- 
tions, and maintains full utilisation for an extended pe- 
riod, because there are more runnable tasks than work- 
ers. Increasing the granularity beyond 30 x 30 leads 
to poorer overall performance, because the overhead of 
task dispatch becomes a significant fraction of task dura- 
tion. Furthermore, the scheduler cannot dispatch tasks 
quickly enough to maintain full utilisation, which ap- 
pears as “noise” in Figure 10. 


6.4 Binomial options pricing 


We now consider another dynamic programming algo- 
rithm: the binomial options pricing model (BOPM) [17]. 
BOPM computes a binomial tree, which can be repre- 
sented as an upper-triangular matrix, P. The rightmost 
column of P can be computed directly from the input pa- 
rameters, after which element p;_; depends on p;_;+1 and 
Pi+1,j+1, and the result is the value of p11. We achieve 
parallelism by dividing the matrix into row chunks, creat- 
ing one task per chunk, and streaming the top row of each 
chunk into the next task. Figure 9(b) shows the element- 
and chunk-level data dependencies for this algorithm. 
BOPM is not an embarrassingly parallel algorithm. 
However, we expect CIEL to achieve some speedup, 
since rows of the matrix can be computed in parallel, and 
we can use streaming tasks (§5.3) to obtain pipelined par- 
allelism. We can also achieve better speedup by increas- 
ing the resolution of the calculation: the problem size 
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Figure 12: Cluster utilisation for three iterations of an 
iterative algorithm (86.5). In the lower case, the primary 
master fails over to a secondary at the beginning of the 
second iteration. The total downtime 1s 7.7 seconds. 


(n) is inversely proportional to the time step (Az), and 
the serial execution time increases as O(n). 

Figure 11 shows the parallel speedup of BOPM on a 
47-worker CIEL cluster. We vary the number of tasks, 
and increase n from 2x 10° to 1.6 10°. As expected, the 
maximum speedup increases as the problem size grows, 
because the amount of independent work in each task 
grows. For n = 2 x 10° the maximum speedup ob- 
served is 4.9, whereas for n = 1.6 x 10° the maximum 
speedup observed is 23.8. After reaching the maxi- 
mum, the speedup decreases as more tasks are added, 
because small tasks suffer proportionately more from 
constant per-task overhead. Due to our streaming 1m- 
plementation, the minimum execution time for a stream 
consumer is approximately one second. We plan to re- 
place our simple, polling-based streaming implementa- 
tion with direct TCP sockets, which will decrease the 
per-task overhead and improve the maximum speedup. 


6.5 Fault tolerance 


Finally, we conducted an experiment in which master 
fail-over was induced during an iterative computation. 
Figure 12 contrasts the cluster utilisation in the non- 
failure and master-failure cases, where the master fail- 
over occurs at the beginning of the second iteration. Be- 
tween the failure of the primary master and the resump- 
tion of execution, 7.7 seconds elapse: during this time, 
the secondary master must detect primary failure, con- 
tact all of the workers, and wait until the workers register 
with the secondary. Utilisation during the second itera- 
tion is poorer, because some tasks must be replayed due 
to the failure. The overall job execution time increases 
by 30 seconds, and the original full utilisation is attained 
once more in the third iteration. 
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7 Alternative approaches 


CIEL was inspired primarily by the MapReduce and 
Dryad distributed execution engines. However, there 
are several different and complementary approaches to 
large-scale distributed computing. In this section, we 
briefly survey the related work from different fields. 


7.1 High performance computing (HPC) 


The HPC community has long experience in developing 
parallel programs. OpenMP is an API for developing 
parallel programs on shared-memory machines, which 
has recently added support for task parallelism with de- 
pendencies [7]. In this model, a task is a C or Fortran 
function marked with a compiler directive that identifies 
the formal parameters as task inputs and outputs. The 
inputs and outputs are typically large arrays that fit com- 
pletely in shared memory. OpenMP is more suitable than 
CIEL for jobs that share large amounts of data that is fre- 
quently updated on a fine-grained basis. However, the 
parallel efficiency of a shared memory system is limited 
by interconnect contention and/or non-uniform memory 
access, which limits the practical size of an OpenMP job. 
Nevertheless, we could potentially use OpenMP to ex- 
ploit parallelism within an individual multi-core worker. 
Larger HPC programs typically use the Message Pass- 
ing Interface (MPI) for parallel computing on distributed 
memory machines. MPI provides low-level primitives 
for sending and receiving messages, collective commu- 
nication and synchronisation [30]. MPI is optimised 
for low-latency supercomputer interconnects, which of- 
ten have a three-dimensional torus topology [35]. These 
interconnects are optimal for problems that decompose 
spatially and have local interactions with neighbouring 
processors. Since these interconnects are highly reliable, 
MPI does not tolerate intermittent message loss, and so 
checkpointing is usually used for fault tolerance. For ex- 
ample, Piccolo, which uses MPI, must restart an entire 
computation from a checkpoint if an error occurs [34]. 


7.2 Programming languages 


Various programming paradigms have been proposed to 
simplify or fully automate software parallelisation. 
Several projects have added parallel language con- 
structs to existing programming languages. Cilk-NOW 
is a distributed version of Cilk that allows developers 
to spawn aC function on another cluster machine and 
sync on its result [11]. X10 is influenced by Java, and 
provides finish and async blocks that allow devel- 
opers to implement more general synchronisation pat- 
terns [15]. Both implement strict multithreading, which 
restricts synchronisation to between a spawned thread 
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and its ancestor [10]. While this does not limit the ex- 
pressiveness of these languages, it necessitates additional 
synchronisation in the implementation of, for example, 
MapReduce, where non-ancestor tasks may synchronise. 

Functional programming languages offer the prospect 
of fully automatic parallelism [8]. NESL contains a par- 
allel “apply to each” operator (i.e. amap () function) that 
processes the elements of a sequence in parallel, and the 
implementation allows nested invocation of this opera- 
tor [9]. Glasgow Distributed Haskell contains mecha- 
nisms for remotely evaluating an expression on a par- 
ticular host [33]. Though theoretically appealing, paral- 
lel functional languages have not demonstrated as great 
scalability as MapReduce or Dryad, which sacrifice ex- 
pressivity for efficiency. 


7.3. Declarative programming 


The relational algebra, which comprises a relatively 
small set of operators, can be parallelised in time 
(pipelining) and space (partitioning) [19]. Pig and Hive 
implement the relational algebra using a DAG of MapRe- 
duce jobs on Hadoop [32, 37]; DryadLINQ and SCOPE 
implement it using a Dryad graph [14, 39]. 

The relational algebra is not universal but can be made 
more expressive by adding a least fixed point opera- 
tor [5], and this research culminated in support for re- 
cursive queries in SQL:1999 [20]. Recently, Bu et al. 
showed how some recursive SQL queries may be trans- 
lated to iterative Hadoop jobs [12]. 

Datalog is a declarative query language based on first- 
order logic [13]. Recently, Alvaro et al. developed a ver- 
sion of Hadoop and the Hadoop Distributed File System 
using Overlog (a dialect of Datalog), and demonstrated 
that it was almost as efficient as the equivalent Java code, 
while using far fewer lines of code [6]. We are not 
aware of any project that has used a fully-recursive logic- 
programming language to implement data-intensive pro- 
grams, though the non-recursive Cascalog language, 
which runs on Hadoop, is a step in this direction [29]. 


7.4 Distributed operating systems 


Hindman et al. have developed the Mesos distributed 
operating system to support “diverse cluster computing 
frameworks” on a shared cluster [25]. Mesos performs 
fine-grained scheduling and fair sharing of cluster re- 
sources between the frameworks. It is predicated on the 
idea that no single framework is suitable for all applica- 
tions, and hence the resources must be virtualised to sup- 
port different frameworks at once. By contrast, we have 
designed CIEL with primitives that support any form of 
computation (though not always optimally), and allow 
frameworks to be virtualised at the language level. 


$ Conclusions 


We designed CIEL to provide a superset of the features 
that existing distributed execution engines provide. With 
Skywriting, it it possible to write iterative algorithms 
in an imperative style and execute them with transpar- 
ent fault tolerance and automatic distribution. However, 
CIEL can also execute any MapReduce job or Dryad 
graph, and the support for iteration allows it to perform 
Pregel- and Piccolo-style computations. 

Our next step is to integrate CIEL primitives with ex- 
isting programming languages. At present, only Skywrit- 
ing scripts can create new tasks. This does not limit uni- 
versality, but it requires developers to rewrite their driver 
programs in Skywriting. It can also put pressure on 
the Skywriting runtime, because all scheduling-related 
control-flow decisions must ultimately pass through in- 
terpreted code. The main benefit of Skywriting is that it 
masks the complexity of continuation-passing style be- 
hind the dereference operator (84.2). We now seek a 
way to extend this abstraction to mainstream program- 
ming languages. 

CIEL scales across hundreds of commodity machines, 
but other scaling challenges remain. For example, it is 
unclear how best to exploit multiple cores in a single 
machine, and we currently pass this problem to the ex- 
ecutors, which receive full use of an individual machine. 
This gives application developers fine control over how 
their programs execute, at the cost of additional complex- 
ity. However, it limits efficiency if tasks are inherently 
sequential and multiple cores are available. Furthermore, 
the I/O saving from colocating a stream producer and 
its consumers on a single host may outweigh the cost 
of CPU contention. Finding the optimal schedule is a 
hard problem, and we are investigating simple annota- 
tion schemes and heuristics that improve performance in 
the common case. The recent work on cluster operating 
systems and scheduling algorithms [25, 27] offers hope 
that this problem will admit an elegant solution. 

Further information about CIEL and Skywriting, in- 
cluding the source code, a language reference and a tuto- 
rial, is available from the project website: 


http://www.cl.cam.ac.uk/netos/ciel/ 
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Abstract 


Effective analysis of raw data from networked systems 
requires bridging the semantic gap between the data and 
the user’s high-level understanding of the system. The 
raw data represents facts about the system state and 
analysis involves identifying a set of semantically rel- 
evant behaviors, which represent “interesting” relation- 
ships between these facts. Current analysis tools, such as 
wireshark and splunk, restrict analysis to the low-level 
of individual facts and provide limited constructs to aid 
users in bridging the semantic gap. Our objective is to 
enable semantic analysis at a level closer to the user’s 
understanding of the system or process. The key to our 
approach is the introduction of a logic-based formulation 
of high-level behavior abstractions as a sequence or a 
group of related facts. This allows treating behavior rep- 
resentations as fundamental analysis primitives, elevat- 
ing analysis to a higher semantic-level of abstraction. In 
this paper, we propose a behavior-based semantic anal- 
ysis framework which provides: (a) a formal language 
for modeling high-level assertions over networked sys- 
tems data as behavior models, (b) an analysis engine for 
extracting instances of user-specified behavior models 
from raw data. Our approach emphasizes reuse, com- 
posibility and extensibility of abstractions. We demon- 
strate the effectiveness of our approach by applying it 
to five analyses tasks; modeling a hypothesis on traffic 
traces, modeling experiment behavior, modeling a se- 
curity threat, modeling dynamic change and composing 
higher-level models. Finally, we discuss the performance 
of our framework in terms of behavior complexity and 
number of input records. 
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1 Introduction 


The ability to convert raw data into higher-level in- 
sights and understanding has become a key enabler in 
many fields. We approach one particular aspect of this 
problem, namely the analysis of data within the domain 
of networked and distributed systems. Such systems rou- 
tinely generate a plethora of logs, trace and audit data 
during their operation. Users, such as researchers and 
system administrators, use this raw data to understand 
system behavior, diagnose problems, discover new be- 
haviors, or verify hypotheses. Effective analysis of such 
raw data requires bridging the semantic gap between raw 
data and the user’s high-level understanding of the anal- 
ysis domain. Our experience with analysis tools reveals 
that this problem is ill-addressed. 


A typical approach to data analysis involves the user 
sifting through the data using simple search and correla- 
tion constructs like boolean queries to identify relation- 
ships and infer meaning from data. For example, wire- 
shark [19] can help identify complete or incomplete TCP 
flows from packet traces and splunk [16] can help iden- 
tify spurious logins from a server log. Our study of four 
popular tools, discussed in Section 2.1, reveals that cur- 
rent approaches require cumbersome multi-step analyses 
to infer semantic relationships from data. For example, 
a user analyzing a network packet trace may first have to 
extract individual flows by specifying specific attribute 
values related to each flow, and then somehow manually 
infer relationships like concurrency between the flows. 
This problem is further complicated if the user has to 
reason and analyze over multiple types of data. This sep- 
aration between the raw data and the meaning it carries 
constitutes the semantic gap. 


In this paper, we focus on the problem of express- 
ing analyses tasks that are meaningful and useful to the 
user. Specifically, given a finite, timestamped list of facts 
about the system under observation, our objective is to 
assist the user in expressing and modeling semantically 
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relevant behaviors, which are “interesting” relationships 
between these facts or sequence of facts. These relation- 
ships encompass notions of ordering, causality, depen- 
dence, or concurrency. 


Our insight is that higher-level understanding in net- 
worked and distributed systems can be expressed in the 
form of relationships between system states, simple be- 
haviors, and complex behaviors. For example, in most 
situations, a typical web-server operation is better un- 
derstood as a concurrent relationship between multiple 
HTTP sessions to a server rather than the details of the 
protocols and specific values in the packet headers. Thus, 
our data analysis approach introduces a behavior as a 
primitive analysis construct. Behaviors can be extended 
or constrained to create a behavior model, which forms 
an assertion about the overall behavior of the system. A 
behavior model can then be rapidly applied over data to 
validate the assertion. We discuss complete details about 
specifying behavior models in Section 3, and Section 4 
presents the analysis engine for extracting instances of 
user-specified behavior models from raw data. 

The behavior models are abstract entities to capture 
the semantic essence of a particular relationship without 
focusing on unnecessary details or particular parameters 
that may vary between individual facts or behaviors. In- 
corporation of abstract behavior models as explicitly rep- 
resented and manipulated constructs within our frame- 
work provides two key benefits. First, this abstraction 
allows users of our framework to analyze and understand 
the raw data at a semantically relevant level. In Sec- 
tion 3.4, we introduce an example of a behavior model 
to identify pairs of communication events where the des- 
tination IP of the second event is same as the source IP of 
the first. Such models can be used to analyze many dif- 
ferent datasets without any modification. Additionally, 
since behavior models are primitive analysis constructs, 
the framework supports extensibility by composing new 
models from behavior models present in the knowledge 
base as demonstrated in Section 5.5. Thus, represent- 
ing analysis expertise explicitly as behavior models for- 
malizes the semantics for data analysis in networked sys- 
tems. 


The second key benefit of our work is the ability to 
foster sharing and reuse of knowledge embedded in ex- 
plicitly represented behavior models. Our first-hand ex- 
perience with existing tools suggests that in most cases 
knowledge inferred from analysis resides either in a 
domain-specific tool or a single expert’s brain. This is 
due to a lack of an explicit representation for captur- 
ing, storing, sharing, and reusing such knowledge in a 
context-independent way. Many current tools are either 
static in nature, handling only a fixed set of analyses 
and record types, or may offer limited extensibility, but 
through some mechanism that involves significant effort. 
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For example, wireshark [19] is easily extensible using 
plugins, but writing a plugin requires understanding the 
wireshark API and C programming skills. In contrast, 
a well defined shareable format for representing knowl- 
edge about networked systems data offers the prospect 
that many different tools can be driven by, and contribute 
to, a single shared knowledge base. 

Beyond the basic challenge, the task of semantic-level 
analysis is difficult for two disparate reasons. First, the 
definition of “interesting” may vary widely in different 
situations, requiring a rich toolbox of techniques for ef- 
fective analysis. We address this problem by restricting 
the definition of “interesting relationships” to expressing 
a particular set of characteristics of networked systems 
as discussed in Section 3.1. Second, in large scale sys- 
tems, efficient and intelligent data analysis is extremely 
resource intensive due to the sheer volume of system 
events and traces. While in Section 6 we report perfor- 
mance results, this paper primarily discusses the funda- 
mental aspects of defining and employing explicit behav- 
ior models as a data analysis tool. Real-time analysis of 
data for applications such as intrusion detection is a fu- 
ture goal as discussed in Section 7. 

The fundamental contribution of this paper is the in- 
troduction of a behavior-based semantic analysis frame- 
work for confirmatory and exploratory analysis of multi- 
variate, multi-type, timestamped data captured from net- 
worked systems. The main elements of the semantic 
framework include (a) a specialized formal language for 
specifying behavior models and (b) an analysis engine 
for extracting instances of user-specified behavior mod- 
els from data. In confirmatory analysis, the user specifies 
a validation criteria, expected system behavior or hypoth- 
esis, by writing a specific model or through composing a 
high-level model from existing models contained within 
the knowledge base of the framework. In exploratory 
analysis, a user applies existing models from the knowl- 
edge base to explore data for new or unanticipated be- 
haviors. In Section 5 we present five detailed examples 
of how the framework can be applied for these data anal- 
ysis tasks. 


2 Related Work 


In this section, we set the context for our work by first 
studying four popular analysis tools followed by a dis- 
cussion on specification-based approaches for analysis of 
networked systems data. 


2.1 Tool Comparison 


In this section, we study four popular analysis method- 
ologies: wireshark v1.2.7 [19], splunk v4.1 [16], Simple 
Event Correlator (SEC) v2.5.3 [18], Bro v1.5.2 [14], and 
compare them with our behavior-based semantic anal- 
ysis framework (SAF). Both wireshark and splunk are 
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Real-time event 


Ascui data from files, 


Simple language for 
specifying rules 


Boolean predicates, 
functions written in 


Perl functions can 
encode semantics 


Matching events can 
trigger creation of new 
high-level events 


wireshark splunk SEC 
System goals Interactive Interactive analysis 
analysis correlation 
Input data Network packets Ascii data from any 
source stdin, pipes 
Specification Boolean logic Boolean logic 
language 
Primitive Boolean Boolean predicates, 
constructs predicates unix-like pipelines 
and commands Perl 
Semantic None External commands 
constructs can encode 
semantics 
Composibility None Queries can be 
of specs recorded and then 
composed into other 
queries 
Abstraction None None Limited 


Bro 


High-speed, real-time 
monitoring 


Network packets 


Bro scripting language 


Events (low-level or 
higher-level) 


Network notions such 
as connections, IP 
addrs., ports, and 
network protocols 


Policies can compose 
lower-level events to 

generate higher-level 
events 


Yes 


SAF 


Interactive analysis 


Any type of data (with 
plugin) 

Formal language based 
on temporal logic, 
interval temporal logic 
and boolean logic 


Behavior (low-level or 
higher-level) 


Temporal logic and 
interval temporal logic 
operators for defining 
behaviors (Section 3) 


Behaviors can be 
composed into higher 
level behaviors 


Yes 


Table 1: Comparison of the behavior-based Semantic Analysis Framework (SAF) with four popular data analysis tools. 
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mainly interactive analysis tools while Bro and SEC are 
real-time monitoring tools. The behavior-based semantic 
analysis framework (SAF) falls in the category of inter- 
active analysis tools. The tools are compared along seven 
dimensions in Table 1; (a) high-level goals, (b) input data 
types, (c) analysis specification language (d) primitive 
analysis constructs, (e) semantic analysis constructs, (f) 
ability to compose specifications and (g) abstraction, that 
is, specifications in terms of relationships between data 
attributes. 


Each paragraph below introduces an analysis frame- 
work and the reader is directed to Table 1 for details. The 
corresponding features for our framework (SAF) are in- 
troduced in Table 1 and explored in future sections. We 
have not considered SQL-based approaches on stream- 
ing data for comparison [6], since SAF representations 
are at a higher-level of abstraction than database query 
languages. However, we further discuss how our frame- 
work could benefit by using the above SQL extensions to 
optimize event storage and retrieval in Section 7. 


wireshark [19] is an open-source tool for interactive 
analysis of a large variety of network data from a packet 
capture file. Wireshark’s design can be separated into 
the analysis framework and plugins. The analysis frame- 
work provides the ability to sift through large volumes 
of packets visually and provides a boolean query gram- 
mar for finding “interesting” relationships and statistical 
summaries over typical networking concepts, for exam- 
ple, rate, flows, bytes, and connections. The plugin archi- 
tecture, on the other hand, is responsible for normalizing 
and presenting different types of packet data and protocol 
behavior to the analysis framework in a uniform way. 


splunk [16] is a popular commercial framework for 
unified data analysis of a large variety of data. Splunk’s 
strength comes from its ability to index various types of 
data, allowing the user to sift through logs by combin- 
ing search queries using boolean operations, pipes and 
powerful statistical and aggregation functions. Splunk 
supports time-based, event-based, value-based correla- 
tions and also allows combining queries into higher-level 
queries. Splunk is extensible using apps, which allow en- 
coding knowledge as queries for sharing and wider dis- 
semination. However, it does not provide support for ex- 
plicitly capturing domain expertise with semantic con- 
structs. It does provide the ability to invoke external 
commands, thus providing an indirect way to incorpo- 
rate explicit domain expertise into the analyses. 


Simple Event Correlator(SEC) [18] is an open- 
source framework for rule-based event correlation. SEC 
reads the analysis specifications from a configuration file 
containing a set of event matching rules and correspond- 
ing actions. SEC processes data from log files, pipes and 
standard streams to trigger the configured actions on a 
match. It supports both time-based and event-based cor- 
relations and also allows specifying abstract rules that 
bind their values at runtime. SEC is more sophisti- 
cated than the previous two tools, it supports composing 
higher-level events by correlating low-level events, pro- 
viding a framework for semantic understanding. Its rule- 
types pair and pairwithwindow capture some of the se- 
mantics of ordering and duration. However, it lacks sup- 
port for inferring interval-based temporal relationships 
like concurrency and overlap and the analysis specifica- 
tion in the configuration files are not intuitive to capture 
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and share domain expertise in a generic way. 

Bro [14] is a high-speed intrusion detection system for 
checking security policy violations by passively moni- 
toring network traffic in real-time. Bro’s security poli- 
cies are written in the specialized Bro scripting language 
which is geared towards security analysis. The lan- 
guage supports semantic constructs such as connections, 
IP addresses, ports, and various network protocols along 
with various operators and functions to express different 
forms of network analyses. Bro has the ability to do time- 
based and event-based correlation. However, Bro mainly 
processes network packet data and uses a programming 
language-based analysis approach. 


2.2 Specification-based Approaches 


Specification-based approaches are particularly appeal- 
ing 1n various areas of networked and distributed systems 
due to their ability to be abstract, concise, precise, and 
verifiable. In formal verification of distributed and con- 
current systems, a system is specified in logic and then 
formal reasoning is applied on the specification to ver- 
ify desired properties [3, 9]. In declarative networking, a 
specification language, Network Datalog (NDLog) [10], 
allows defining high-level networking specifications for 
rapidly specifying, modeling, implementing, and experi- 
menting with evolving designs for network architectures. 
In testbed-based experimentation, a simple set of user- 
supplied expectations are used to validate expected be- 
havior of an experiment [12]. 

The formal specification approaches have been well 
developed within the intrusion detection community and 
have been successfully applied to network and audit data 
for analysis. In this section we first present a brief 
overview of four such approaches and then compare 
them to SAF. 

Roger et al. [15], leverage the idea that attack signa- 
tures are best expressed in simple temporal logic using 
temporal connectives to express ordering of events. They 
pose the detection problem as a model-checking prob- 
lem against event logs. Naldurg et al. [13], propose an- 
other temporal-logic based approach for real-time mon- 
itoring and detection. Their language EAGLE supports 
parameterized recursive equations and allows specifying 
signatures with complex temporal event patterns along 
with properties involving real-time, statistics and data 
values. Kinder et al. [8], extend the logic CTL (Computa- 
tion Tree Logic) and introduce CTPL (Computation Tree 
Predicate Logic) to describe malicious code as a high- 
level specification. Their approach allows writing spec- 
ifications that capture malware variants. Ellis et al. [4], 
introduce a behavioral detection approach to malware by 
focusing on detecting patterns at higher-level of abstrac- 
tions. They introduce three high-level behavioral signa- 
tures which have the ability to detect classes of worms 
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without needing any apriori information of the worm be- 
havior. 

The SAF abstract models are comparable to the ap- 
proaches of [13, 8, 4] in their use of formal logic and 
temporal constructs for specifications. But, in addition 
to providing an extended set of sophisticated intuitive op- 
erators and constructs, the behavior models presented in 
this paper can be generically applied to model various 
scenarios over a variety of data and are easily composed 
into semantically relevant higher-level models. This al- 
lows creating a knowledge base to explicitly capture do- 
main expertise required for analyzing a large variety of 
operations encountered in networked and distributed sys- 
tems as shown in Section 5. The higher-level behav- 
ioral signatures [4] based on the network-theoretic ab- 
stract communication network (ACN) are tightly bound 
to networking constructs like hosts, routers, sensors and 
links making them very restrictive in their ability to ex- 
press general networked systems behaviors. 

The SAF is based on a logic-based specification ap- 
proach rather than a programming language-based spec- 
ification approach like the one followed in Bro. Our 
goal is that the behavior models should be abstract but 
also concise and precise to support well-known knowl- 
edge representation and reasoning approaches. Logic 
is declarative and type-free, imparting formal seman- 
tics, abstract specifications, and efficient processing by 
analysis engines. The logic-based approach also enables 
building a knowledge base of behavior models to explic- 
itly capture domain expertise that can be used to auto- 
matically reason and infer behavior models. However, 
logic-based approaches are less expressive than program- 
ming languages. The expressiveness of our approach 
is based on requirements derived from characteristics of 
networked systems as discussed in Section 3.1. 


3 Behavior Models 


A particular execution of a networked system or process 
can be captured as a sequence of states, where a state 
is a collection of attributes and their values. A behav- 
ior (b) is a sequence of one or more related states. A 
system execution is thus defined as a combination of dif- 
ferent behaviors, and each new execution may generate 
a unique set of behaviors. A behavior model (¢) is a for- 
mula that makes an assertion about the overall behavior 
of the system. 

For example, consider a simplified IP flow in net- 
working, where a flow is a communication between two 
hosts identified by their IP addresses. For simplicity 
we assume an IP flow to be broken into two states: 
ip_s2d denotes a packet from some source to destina- 
tion host and ip_d2s denotes a packet from a destination 
to source. Then, a valid IP flow behavior, IPFLOW, 1s 
one where ip_s2d and ip_d2s are related by their source 
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and destination attributes with the additional criteria that 
ip_d2s always occurs after ip_s2d. The behavior model 
(Pip flow) 18 an assertion that IPFLOW is valid. We dis- 
cuss details of this example and extend it further in Sec- 
tion 3.4. 

In this section, we first discuss the requirements and 
design choices for a language to specify behaviors fol- 
lowed by the formal syntax and semantics of the lan- 
guage. 


3.1 Requirements 


As discussed in Section 1, the key objective of our frame- 
work is to enable semantic-level analysis over data. A 
semantically expressive language for analysis over net- 
worked and distributed systems data must meet the fol- 
lowing requirements: (a) enable analysis over multi- 
type, multi-variate, timestamped data, (b) express a wide 
variety of “interesting” relationships, (c) enable analysis 
over higher-level abstractions, and (d) enable composing 
abstractions into higher-level abstractions. 

The language should express at-least the following 
“interesting” relationships to capture the core character- 
istics of networked and distributed systems: (a) causal 
relationships between behaviors, for example, a file be- 
ing opened only if a user is authorized; (b) partial or to- 
tal ordering, for example, in-order or out-of-order arrival 
of packets; (c) dynamic changes over time, for example, 
traffic between client and server drops after an attack on 
the server; (d) concurrency of operations, for example, 
simultaneous web client sessions; (e) multiple possible 
behaviors, for example, a polymorphic worm behavior 
may vary on each execution; (f) synchronous or asyn- 
chronous operations, for example, some operations need 
to complete within a specific time whereas others need 
not; (g) value dependencies between operations, for ex- 
ample, a TCP flow is valid only if the attribute—values 
contained in the individual packets are related to each 
other; (h) invariant operations, for example, some opera- 
tions may always hold true and, (1) eventual operations, 
for example, some operations happen in the course of 
time. In addition, we need traditional mechanisms, such 
as boolean operators and loops, for combining these re- 
lationships into complex behaviors and mechanisms for 
basic counting of events and reasoning over the counts. 

We do not claim completeness of the above require- 
ments but we believe that being able to express the above 
classes of primitive relationships and combining them 
to form complex relationships would suffice for a wide 
range of situations, a few of which we demonstrate as 
case studies in Section 5. 


3.2 Design 


The following four design decisions realize the require- 
ments listed above. First, our framework provides logic- 


based support to formulate behavior abstractions as a se- 
quence or group of related events, where events are uni- 
form representation of system facts as discussed later. 
This formulation allows treating this behavior represen- 
tation as fundamental analysis primitive, elevating anal- 
yses to a higher semantic-level of abstraction. 

Second, the language combines operators from Allen’s 
interval-temporal logic [1], Lamport’s Temporal Logic 
of Actions [9] and boolean logic. Temporal logic allows 
expressing the ordering of events in time without explic- 
itly introducing time. Interval-temporal logic allows ex- 
pressing relationships like concurrency, overlap and or- 
dering between behaviors as relationships between their 
time-intervals. Additionally, complex behaviors are eas- 
ily composed from simpler ones using boolean operators. 

Third, the framework enables specifying dependency 
relationships between event attributes while leaving the 
values to be dynamically populated at runtime. Late 
binding enables abstract specifications that enrich the 
knowledge base as they can be directly applied to a wide 
variety of data-sets. This also enables parametrization of 
models during complex model composition as discussed 
in Section 5.5. 

Lastly, the framework introduces the notion of a 
domain-independent event as a uniform representation 
of multi-type, multi-variate, timestamped data. Specif- 
ically, an event (e) is a representation of system state 
and is given by a 4-tuple (0,c,t,av) where o is the 
event-origin (for example, the host IP), c is the event- 
type (for example, PKT_TCP or APP_HTTPD), ¢ is the 
event timestamp and av = { (a;,u;)|a; € A,u; € 
Strings, 1 <7 < D,} are the attribute-value pairs con- 
tained in the event. A is the set of attribute labels, for ex- 
ample, s2p, dip, etype. D, is the number of attributes in 
an event of type c. This normalization of data to events 
ensures that the analysis algorithms are independent of 
the input domain. 

We believe these design decisions ensure developing 
abstract behavior models as first-order primitives for cap- 
turing, storing, and reusing domain expertise for the anal- 
ysis of networked systems. Next we discuss the syntax 
of such a language. 


3.3. Syntax 


The language grammar for defining a behavior model 
@ as a formula, consists of five key elements as shown 
in Figure 1: state propositions S as atomic formulae; 
grouping operators ‘( and ‘)’ to define sub-formulae; 
logical operators and temporal operators for relating 
sub-formulae or atomic-formulae; the optional behavior 
constraints bcon and operator constraints opcon written 
within ‘|’ and ‘|’; and the relational operators relop. 

A state proposition, S, is an atomic formula for cap- 
turing events that satisfy specified relations between at- 
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@o 3= ‘“ S|¢‘y {bcon } 
| notd (negation) 
| odandd (logical and) 
| gord (logical or) 
| @xor®¢d (logical xor) 
| p ~? (opcon) p (leadsto) 
| UU (opeon) p (always) 
| =P O1aP(gncon) P (overlaps) 
| p dur (opcon) p (during) 
| 8W(opcon) ¢ (startswith) 
|} Pewereon): © (endswith) 
| Ped onecon) ? (equals) 
boon :=  ‘[ {te] cc} ‘7 
tc =  f{at|duration|end} relop t{: t} 
ce i=  {icount |bcount|rate} relop c{: c} 
opcon ::= ‘[ relop t{: t} ‘[ 
reop 2= {>|<|=|>|<|4} 


t 
Cc 


[o—9]+ {s|ms} 
[oO — 9]+ 


Figure 1: The grammar for specifying a behavior model @. 


tributes and their values. In essence, S captures states of 
a system or process and is the basic element of a behav- 
ior model. The most trivial behavior model is one with a 
single state proposition. Formally, S is represented as a 
finite collection of related attribute-value tuples as: 


S= 1 Geta) |2 S N, ay € A, Vi € V, 
re € (= es <; ea < #)} 


A is a set of string labels, such as sip, dip, 
etype and V is a set of string constants, such as 
10.1.1.2,/bin/sh, along with two special strings: (a) 
strings prefixed with ‘$’, as in $S,$s2.dst (b) strings 
with the wild-card character **’, as in /etc/pas~«. Con- 
sidering our previous example of IPFLOW, the state 
propositions ip_s2d and ip_d2s are written as: 


{etype=PKT_IP, sip=$$,dip=s$} 
{etype=PKT_IP, sip=Sip_s2d.dip, 


ip-s2d = 
ip-d2s = 
dip=$ip_s2d.sip} 


State proposition ip_s2d contains three attributes 
etype, sip and dip. etype has a constant value 
PKT_IP, while sip and dip attributes use the ‘$’ pre- 
fixed special variables which are dynamically bound at 
runtime. State proposition ip_d2s defines the values of 
its sip and dip attributes as being dependent on val- 
ues of state ip_s2d. Dependent attributes along with dy- 
namic binding of values allows leaving out details like 
the actual IP addresses from the specification. 

The temporal operators allow expressing temporal re- 
lationships like ordering and concurrency between one- 
or-more behaviors. The linear-time temporal operator ~~» 
(leadsto), written as ~>, is used to express causal rela- 
tionships between behaviors. The interval temporal logic 
Operators express concurrent relationships between be- 
haviors as either relationships: (a) between their start- 
times using sw (startswith), (b) between their endtimes 
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using ew (endswith) or (c) between their durations using 
olap (overlap), eq (equals) and dur (during). The U1 
(always) operator, written as | |, allows expressing invari- 
ant behaviors. The logical operators not, and, or, xor 
are supported for logical operations over behaviors and 
for creating complex behaviors. 

Behavior constraints allow placing additional con- 
straints on the matching behavior instances and are spec- 
ified immediately following the behavior within square 
brackets. Constraints and their values are related using 
the standard relational operators. The six behavior con- 
straints are divided as time constraints tc and count con- 
straints cc. Time constraints allow constraining behav- 
ior starttime using at, behavior endtime using end and 
behavior duration using duration. The time value, 7, 
for the constraint can be specified as a single positive 
value or as a range. Additionally, the values can be suf- 
fixed with either ‘s’ or ‘ms’ to indicate seconds or mil- 
liseconds respectively. The count constraints allow con- 
straining number of matching behavior instances using 
icount, the size of each behavior instance using bcount 
and rate of events within a behavior instance using rate. 
Operator constraints allow specifying time bounds over 
the temporal operators thus allowing their semantics to 
be slightly modified. The operator constraint values are 
specified as a single value or a range along with a rela- 
tional operator. Table 2 presents detailed semantics of 
operators along with behavior and operator constraints. 

Expressing a behavior in the language constitutes writ- 
ing sub-formulae. Behaviors are always enclosed within 
parenthesis ‘( and ’)’. Simple behaviors are constructed 
by relating one-or-more state propositions using opera- 
tors, while complex behaviors are constructed by relat- 
ing one-or-more behaviors. The grammar also allows 
expressing complex behaviors using recursion and we 
present an example in Section 5.3. Recursive definitions 
allow expressing looping behavior for which the loop 
bounds can be optionally specified using the bcount be- 
havior constraint. The current grammar does not support 
existential and universal quantification since such a need 
is not clear. We explore these language extensions as part 
of our future work. 

Writing behavior models in the framework involves 
additional syntax such as namespaces, headers and vari- 
ables which are discussed along with the case-studies in 
Section 5.1 and Section 5.2. Next section presents the 
formal semantics of the language. 


3.4 Semantics 


We first define two concepts important for understanding 
the semantics. A sequential log (L) is a finite sequence 
of timestamped events L = ej, €9,€3,...,€, Such that 
e;.t <e;.t ,Vt < gy. A behavior instance Bg for a be- 
havior model ¢ is sequence or groups of events satisfying 
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Behavior model 7 


(9) 
8 


(neg ¢) 
(1 and ¢2) 
($1 or ¢2) 
(p1 xor $2) 
(1 ~> ¢2) 


(d1 ~ [< ¢] G2) 


(O¢) 
(Ol= #] ¢) 


(o1 sw $2) 
($1 sw[> t] $2) 
(d1 ew $2) 
($1 ew|= t] b2) 


($1 olap ¢2) 


(1 olap|> t] ¢2) 
(1 eq G2) 
(1 eq|= t] 2) 


(d1 dur ¢2) 


Pl dix |= ee t2| 2) 
p)|icount = c] 

g)[bcount = c| 
g) 


( 
( 
( 
(p)[rate > c| 
(g)[at <4] 
(¢)[end > ¢] 


(p)|duration # t| 


Meaning of ~ 
@ is a behavior. 


S is a State proposition defined as 

D4 (is Vis U1 ) oy l Gas Fae) 
Negation of behavior is true. 

Both @, and @2 are true. 

1 and ¢2 are not both false simultaneously. 
Either of @1 or $2 are true but not both. 


1 leadsto ¢2, that is, whenever @ is satisfied do will 
eventually be satisfied. 


Whenever 1 is satisfied @2 will be satisfied within t 
time units. 


o is always satisfied, that is, satisfied by each event. 


¢ is always satisfied within every consecutive 
interval(epoch) of ¢ time units. 


1 Starts with do. 
go, Starts t time units after do. 
1 ends with doa. 
1 ends ¢ time units after ¢o. 


1 overlaps ¢2, that is, 1 starts after G2 starts but 
before 2 ends and ends after ¢2 ends. 


1 overlaps $2 and the overlapping region is greater 
than ¢ time units. 


1 equals ¢2 in duration. 
1 and $2 are both of duration t. 


1 occurs during @2, that is, G1 starts after 2 and 
ends before @2 ends. 


1 occurs during @2 with duration between ¢t1 and fo. 


The number of behavior instances satisfying ¢ is c. 
Behavior instances satisfying @ are of size c. 
Behavior instances satisfying @ have a rate, defined as 
(behavior size / behavior duration) greater than c. 
Starting time of behavior instances satisfying @ must 
be less than absolute time ft. 

Behavior instances satisfying @ have endtime greater 
than absolute time f. 

Behavior instances satisfying ¢ are of duration $ t. 


L satisfies 7) (L — w) iff 
IB, C Land |By| > 0 
(a) |Bs| > 0,(b) Ve € Bg, Vi € {1,...,d},e.a; 
is defined and values e.v; and S.v; satisfy relation r;. 
L F 4, that is, |Bg| = 0 


LE ¢1 and LE ¢2 
LE ¢1 or L EF 41 or satisfies both ¢1 and ¢2 
LE ¢1 or L FE ¢2 but not both 


(a) L K di and L E ¢o, (b) Bg, [1] 4 Bg, [1], © 
Bg, .starttime > By, .endtime 


(a) L = (61 ~ $2), (b) 
Bg, .starttime < (By, .endtime + t) 


VeeEL,e Fr 

t > O and for all consecutive intervals ¢, /¢ C L and 
neo 

(a) L — i andL F $2, (b) Bg, [1] F Be, [1], © 


Bg, .starttume = By, .starttime 


(a) L = (¢1 sw ¢2), (b) 
Bg,.starttime > (Bg,.starttime + t) 


(a) L = Pl and L = 2, (b) Bg, [1] a Boo [1], (c) 
Bg, .endtime = Bg,.endtime 


(a) L F (¢1 ew 2), (b) 
Bg, .endtime = (Bg, .endtime + t) 


(a) L —& gi andL F $2, (b) Bg, [1] F Bg [1], © 
(Bg, .starttume < Bg, .starttime < 

Bg, .endtime) and 

(By,-endtime > By, .endtime) 


(a) L E (1 olap ¢2), (b) the overlap 
(By, .endtime — Bg, .starttime) >t 


(a) L K di and L K ¢2, () Bg, [1] 4 Bop [1], ©) 


Bg, -duration = Bg, .duration 


(a) L & ($1 eq G2), (b) 


ioe duration = Dis, duration = t 


(a) L —& i andL F $2, (b) Bg, [1] F Bep [1], © 
(By,.starttume > Bg, .starttime) and 
(Bg,-endtime < Bg, .endtime) 

(a) L — (¢1 dur ¢2), (b) (t1 < Bg, .duration < tz) 
(a) L - 4, (b) there exist distinct Bs Bs CE 

(a) L — 4, (b) By.bcount = c 

(a) L — @, (b) (By.bcount/By.duration) > cand 
By.duration > 0 

(a) L & @, (b) By.starttime < t 


(a) L & ¢, (b) By.endtime > t 


(a) L — ¢, (b) By.duration F t 


Table 2: Semantics of operators, behavior constraints and operator constraints in our logic. We describe semantics for constraints considering only 
a single relational operator and refer the reader to the framework webpage [17] for details. 


the behavior model @. 


element. bcount = &k is the total number of elements 


in the behavior instance. All 6;’s are in increasing time- 


Bg = (starttime, endtime, bcount, (61, b2,..., bx)) 


where (b;,b2,...,b%) © L could be an individual 
event e or another behavior-instance By,. starttume = 
b,.starttime is the starting time of the behavior as de- 
fined by its first element and endtime = bz.endtime 
is the ending time of the behavior as defined by its last 
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order of their starttume. Additionally, let Bg.duration 
= (By.endtime — Bg.startime) be the duration of the 
behavior instance and |By| = By.bcount represent the 
size of behavior instance. If ¢ is a simple behavior, such 
as a State proposition S, then 


Bs = (ey ar Cis ols k, (ei, , 7 


: , Ci, )) 
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Figure 2: Sequence diagram of IP-interaction between four nodes. —> 
or + represent an IP packet between a source (s) and destination (d). 
An IP flow is a packet pair between s and d. 


where (€;,,.--,6€:,) CL. 

Given a finite sequential log LZ and a user-defined be- 
havior model ¢, goal of the analysis is to find all behavior 
instances (By, B3,...) from L that satisfy the behavior 
model, where satisfiability is defined as follows: 


LE ¢ iff By C L and |Bg| > 0 


That is, the log L satisfies (=) the behavior model ¢ iff 
there exists a behavior instance By in L of finite length 
|Bg|. Since ¢ is a composite formula created using many 
sub-formulas, the satisfiability of @ is determined as a 
function of satisfiability of its sub-formulae. Table 2 de- 
fines the satisfiability criteria for sub-formulae formed 
using the operators and constraints. We next explain the 
key language ideas by defining simple models and apply- 
ing them to a fictitious data set. 

Assume a packet trace of seven IP packets represent- 
ing an interaction between four nodes A, B, C and D as 
shown in Figure 2. Let the sequential log of correspond- 
ing events be €1, €2,..., €7. 

Using the states ip_s2d and ip_d2s defined earlier 
in Section 3.3, IP flow behavior is written as a causal 
relationship between the state propositions ip_s2d and 
ip-d2s aS IPFLOW=(ip_s2d ~» ip_d2s). There are 
three IP flow instances in Figure 2 that satisfy IPFLOw, 
that is, z<count = 3 with bcount = 2 for each instance: 


Bipfiow — (€1, €7) 
Bip flow = (€2, €5) 
Be tow = (e3, €4) 


Extending the example, a complex behavior for 
pairs of overlapping IP flows can now be written as 
IPFLOW_PAIRS=(IPFLOW olap IPFLOW). ‘There are 
in all three instances of overlapping IPFLOW pairs from 
Figure 2. That is, 


Ee ee = (Ge C7), (eo, €5)) 

Be tweet = ((e1, e7), (es; e4)) 

Be ies = ((é2, es), (es, e4)) 
Again, icount = 3 and for each instance bcount = 2, 
since bcount counts the number of IPFLOW occurrences 
and not individual events. 
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We can additionally define a bad IP flow behav- 
10r BAD_IPFLOW as one for which there was no 
matching response from the destination. That is, 
BAD_IPFLOW=(ip_s2d ~ (not ip_d2s)). Event 
€g matches BAD_IPFLOW model since it has no matching 
response. That is, Bide flow = (€6), with beount = 1. 

The next section describes the architecture of the anal- 
ysis framework. 


4 Semantic Analysis Framework 


Given our objective of semantic-level data analysis, we 
require the analysis framework to support (a) analysis 
of multi-type, multi-variate, timestamped data, (b) defin- 
ing new models by composing existing models, and (c) 
storage, retrieval and extensibility of domain-specific be- 
havior models. The framework has five components as 
shown in Figure 3; the knowledge base, a data normal- 
izer, an event storage system, an analysis engine and a 
presentation engine. The decoupling of behavior model 
specification, the input processing and the analysis al- 
gorithms, allows the framework to be directly applied 
across several different domains. Subsequent sections 
discuss the details of each component. 


4.1 Knowledge Base 


The knowledge base provides a namespace-based stor- 
age mechanism to store behavior models and is central 
in providing an extensible framework. For example, our 
networking domain currently defines models for ipflow, 
tcpflow, icmpflow and udpflow. These behavior models 
capture common domain information and allow a user 
to rapidly compose higher-level models by reusing exist- 
ing behavior models. Reusing a behavior model from the 
knowledge base constitutes importing it using its names- 
pace and name. For example, referring to the behavior 
model in Figure 4(a), line 5 imports the IPFLOW model 
from the NET.BASE_PROTO domain. The namespace al- 
lows categorization of models into domain-specific areas 
while allowing composition of models across domains. 
We implement namespaces similar to Java namespaces, 
that is, each component in the namespace corresponds to 
a directory name on the filesystem. This simple design 
ensures that the knowledge base is easily customizable 
and extensible. 


4.2 Data Normalizer 


The data normalizer maps a data record to the event for- 
mat defined in Section 3.2. Raw data accepted by the nor- 
malizer can be in the form or trace files, packet dumps, 
audit logs, security logs, syslogs, kernel logs or script 
output with the only requirement that each data record 
have a timestamp and a message field. Specialized plug- 
ins in the normalizer convert each type of raw data into 
corresponding events. Figure 3(b) shows a possible event 
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[header] 

name=IPFLOW 
namespace=net.base_proto 
qualifier = {etype = PKT*} 
import = None 


(a) IPFLOW models the higher-level 
semantics of an IP flow as a causal 
relationship between an IP packet pair. 


[states] | 
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Presentation Engine 





(b) Example PKT_IP Event 


Semantic An aly sis j with two sample attributes. 


Framework (SAF) 








A Output 


Summary for tree path :IPFLOW_b 








Total Matching Instances: 11 


(c) Behavior instances from a 
packet trace satisfying 
the IPFLOW model. 


PKT IP | 1261169032 |10.1.1.1 | 10.1.1.2 
PKT IP | 1261169032 |10.1.1.2 | 10.1.1.1 


11 more instances like above ... 


Figure 3: The semantic analysis framework (SAF) captures a user’s higher-level analysis intent as (a) a behavior model, applies the model over (b) 
a finite stream of events normalized from raw data, and (c) outputs events satisfying the behavior model. 


format for an IP packet from a packet dump. The current 
normalizer supports a C-based plugin API for writing 
new specialized plugins. The framework includes plu- 
gins for the basic packet-types of IP, TCP, UDP, ICMP, 
DNS along with plugins for parsing syslog, auth and 
server logs. 


4.3. Event Storage 


The event storage component is responsible for storing 
the events from the data normalizer into a database. Ev- 
ery event-type has a separate table, the columns of the 
tables correspond to the event attributes and each row 
describes an event. The current implementation stores 
all events into a SQLite database for two reasons: (a) it 
provides a standard and ready-to-use interface for stor- 
ing and fetching events and (b) its server-less operation 
and open-source nature ensures portability on commod- 
ity systems. Our experience suggests that SQLite per- 
forms reasonably well for a large number of situations 
but presents challenges for complex analysis as the vol- 
ume of events increases. Our future work includes inves- 
tigating the scale and efficiency challenges involved in 
storage and retrieval of events. 


4.4 Analysis and Presentation Engine 


Given a finite sequential log L and a user-defined behav- 
ior model ¢, goal of the analysis engine is to find all be- 
havior instances (Bj, B3,...) from L that satisfy the be- 
havior model. Let the events in L be stored internally in 
the event storage database F/y,. We discuss only the key 


ideas behind the analysis process by describing extrac- 
tion of behavior instances satisfying the IPFLOW model 
defined in Section 3.4 from the sample data in Figure 2. 


The behavior model ¢ is first internally represented 
in a manner similar to a compiler expression-tree and 
is then evaluated left-to-right in a post-order fashion. 
The satisfiability of the behavior model is determined 
as a function of satisfiability of each of the compo- 
nent behaviors according to the semantics defined in 
Table 2. For the IPFLOW model, the state proposition 
ip_s2d={et ype=PKT_IP, sip=$$,dip=$$} 1s evalu- 
ated first. Since it does not have any dependent at- 
tributes, its expression is converted to the following 
query {et ype=PKT_IP, sip=*,dip=*} and is used to 
fetch all events in Fg, matching the query. All events 
(€1, €2, €3, €4, €5, C6, €7) Match the state ip_s2d. 


Next, the proposition ip_d2s={etype=PKT_IP, 
sip=Sip_s2d.dip, dip=Sip_s2d.sip} is evaluated. 
The attributes depend on the attributes of state ip_s2d. 
So, using each event that matched ip_s2d, a correspond- 
ing query is generated by resolving the values of sip and 
dip using the values from the matched events. From Fig- 
ure 2, e€; matches e7, eg matches es, e3 matches e4. és 
and eg are also possible candidates but since es already 
matched e€g, it is not paired with eg. Finally, the oper- 
ator ~» is evaluated, where the satisfiability criteria de- 
scribed in Table 2 is applied and any specified operator 
constraints are checked. The three instances satisfying 
the criteria (€1,e€7), (€2, e5), and (e€3,e4) are returned. 
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The presentation engine is responsible for extracting 
the output from the analysis stage and presenting it in a 
summarized format. We currently support printing the 
output in a tabular format as shown in Figure 3(c). We 
next present a brief analysis of the algorithm. 

Algorithm Analysis As described in Section 3.3, 
state propositions could either contain constant attribute- 
values (cStates), such as 10.1.1.2; dependent values 
(dStates), such as $s1.dip; or dynamic values (iStates), 
such as $$. A simple behavior consists of a combination 
of these states using one or more combinations of oper- 
ators and constraints. We assume a constant processing 
time for all operators and constraints. Then, given an 
input of NV events, processing a state proposition can in- 
volve two important operations which influence the run- 
time: (1) querying using the state expression and (11) pro- 
cessing the results of the query if any. In the case of 
cStates and iStates, there is exactly one query made, and 
it generates at most NV responses. Thus, the worst case 
for processing those N responses is O(V). In the case of 
a dstate, given N events, there are N queries to be made 
and in the worst case every query may return O(N) re- 
sults that have to be processed. Thus, processing depen- 
dent states involves a worst case of O(N”) operations. 
We present our performance results in Section 6. 


5 Case Studies 


In this section, we evaluate the utility of our semantic 
framework by applying it to five different analysis sce- 
narios: (a) confirming a hypothesis on collected net- 
work traces, (b) specifying expected system behavior 
during network experimentation, (c) modeling worm be- 
havior as an example security threat, (d) modeling dy- 
namic change, and (e) rapidly composing models to cre- 
ate higher-level behaviors. We present detailed explana- 
tion of input, the behavior model and analysis output for 
the first two cases. Due to space constraints, we briefly 
discuss the remaining three cases with their correspond- 
ing behavior models, demonstrating features of our se- 
mantic analysis framework. 


5.1 


Researchers frequently need to validate hypothesis or test 
results presented by other researchers. We emulate one 
such scenario by validating the results presented by Hus- 
sain et al. [5] to demonstrate how behavior models can be 
rapidly created to reproduce results. We also discuss the 
syntax involved in writing a complete behavior model. 
In the above referenced paper, a threshold-based 
heuristic was presented to identify DDoS attacks in 
traces captured at an ISP. Attacks on a victim were iden- 
tified by testing for two thresholds on anonymized traces: 
(a) the number of sources that connect to the same des- 
tination within one second exceeds 60, or (b) the traffic 


Modeling Hypothesis 
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- [header] 

- NAMESPACE=NET.ATTACKS 

- NAME=DDOS_HYP 

. QUALIFIER={} 

- IMPORT=NET.BASE PROTO. IPFLOW 


OF WN 


Ov 


. [states] 
7. SA=IPFLOW.ip s2d() 
8. SB=IPFLOW.ip s2d(dip=$sA.dip) 


9. [behavior] 
10.hyp_1=(sA) [bcount=1] ~>[<=1s] (sB)[bcount>=59 ] 
ll.hyp 2=(sA)[rate > 40000] 


12. [model] 
13.DDOS_HYP(timestamp,sip,dip,etype)= (hyp _1 or hyp 2) 


(a) DDOS_HYP models two thresholds for detecting DDoS attacks. 


Summary : DDOS HYP_hyp1l 


Total Matching Instances: 2 
Instance : 1 of 2 (Total Event Count: 60) 


State Definition: sA 


1025390156 |201.199.184.56|87.231.216.115| PKT_ICMP 


State Definition: ~> [<= 1s ] sB [ ecount >= 59 ] 
1025390156 |201.199.184.56|87.231.216.115| PKT ICMP 
1025390156 |201.199.184.56|87.231.216.115| PKT ICMP 


<truncated output containing remaining 57 events> 


Instance : 2 of 2 (Total Event Count: 60) 


State Definition: sA 
1025390157 =|53.232.170.113|87.134.184.48 | PKT_ICMP 


State Definition: ~> [<= 1s ] sB [ ecount >= 59 ] 
1025390157 =|33.138.213.170|87.134.184.48 | PKT_ICMP 


1025390157 |33.138.213.181|87.134.184.48 | PRT ICMP 
<truncated output containing remaining 57 events> 


(b) Behavior instances satisfying the DDOS_HYP model. 


Figure 4: Behavior model for confirming a hypothesis and correspond- 
ing behavior instances from network traces satisfying the model. 


rate exceeds 40,000 packets/sec. We demonstrate the ad- 
vantages of behavior model-based analysis by defining 
a model to test for the two heuristics listed above using 
10 seconds of the trace file containing the start of an at- 
tack. We normalize the packet traces to 142,530 PKT_IP 
events. 


Referring to the model script shown in Figure 4(a), 
lines 2—5 define the model header. Line 4 does not 
specify any qualifying conditions, that is, filters, for the 
events it can process. Line 5 imports the IPFLOW model 
from the knowledge base. Lines 7-8 define the neces- 
sary state propositions. Line 7 defines sA, a simple state 
which just captures an IP packet from some source to 
destination. Line 8 defines a state sB with a dependency 
that its dip has to be equal to the dip in sA. State sA 
thus provides a context for sB. 


Line 10 expresses the first hypothesis that there should 
be more than 60 sources connecting to the same destina- 
tion for an attack. We apply the ~~ operator to denote that 
we expect sA to occur before sB. The behavior constraint 
bcount (refer Section 3.4) applied to sA limits number of 
events returned to 1, whereas it is applied to sB so that at- 
least 59 events should occur since the event matching sA 
occurred. Additionally, the operator constraint [<=1s] 
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realns (R) 
(eby.com) 
(10.1.6.3) 


| Response 











Query 


NXDOMAIN with CORRECT 
xXxxx.eby.com 


dnsauth = realns.eby.com 





victimns (V) 








(10.1.4.2) 


Query Forged Responses with fake 
Xxxx.eby.com dnsauth = fakens.fake.com 

















attacker (A) 
(10.1.11.2) 











(a) DNS Kaminsky experiment setup. 


+ A sends query to V t 


{ V forwards query to R > 


A sends 
R responds to V INCORRECT A sends the 
FORGED CORRECT 
v | |fesponse to V 


ora 


A sends ‘Ratene 
a INCORRECT > ~R responds . ae 
responses to V ToV 


response to V- | 


(b) Set of possible experiment behaviors. 


1. [header] 

2. NAMESPACE = NET.ATTACKS 

3. NAME = DNSKAMINSKY 

4. QUALIFIER = {etype='PKT DNS'} 

5. IMPORT = NET.APP_ PROTO.DNSREQRES 


6. [states] 
7. # Attacker to victim query 
8. AtoV_query = DNSREQRES.dns_ req() 


9. # Victim to real ns query 
10. VtoR_query= DNSREQRES.dns_ req(sip=SAtoV_query.dip, 
dnsquesname=SAtov_query.dnsquesname ) 


11.# Real NS to victim real response 
12.Rtov_resp = DNSREQRES.dns res($VtoR_query, 
dnsauth=fakens.fake.com) 


13.# Attacker to victim CORRECT fake response 
14.AtoV_resp = DNSREQRES.dns res($VtoR_query, 
dnsauth=realns.eby.com) [bcount>=1] 


15.# Attacker to victim INCORRECT response case 
16.AtoV_noresp = DNSREQRES.dns res(S$VtoR_query, 
Lids dnsid != $VtoR_query.dnsid) [bcount>=1] 


18. [behavior] 

19.initial query = (AtoV_query ~> VtoR_query) 

20.b 1 = initial query~>RtoV_resp ~> (AtoV_resp xor 
AtoV_noresp) 

initial query ~> AtoV_noresp ~> RtoV_resp 

initial query ~> AtoV_resp ~> RtoV_resp 


Zick 
22.b_ 


2 
3 


23. [model] 
24.FAILURE(sip,dip,sport,dport,dnsid,dnsauth) 
25.SUCCESS(sip,dip,sport,dport,dnsid,dnsauth) 


(c) DNSKAMINSKY models complete experiment behavior. 


b 1 orb 2 
b 3 


Figure 5: Experiment setup, possible set of behaviors and corresponding behavior model for validating a networked experiment. 


binds sA and sB to occur within a second in the order 
specified. 

Line 11 defines the second hypothesis that requires 
that the packet rate be > 40,000 by using the rate con- 
straint on state proposition sA. Lastly, line 13 defines 
the behavior model DDOS_HYP which asserts that either 
hyp-_l or hyp-_2 or both are valid. The four attributes 
timestamp, sip,dip,etype are reported in the final 
output. 

When the model is applied to the packet trace, it pro- 
duces an output as shown Figure 4(b). We see that there 
are two instances reported matching hypothesis hyp_1 
both with 60 events within a | second interval. The out- 
put also shows the corresponding state or behavior def- 
initions matching the following events. The two desti- 
nation IPs that are under attack are 87.231.216.115 and 
87.134.184.48. This output is consistent with the find- 
ings reported in the original paper [5]. 

This example clearly demonstrates the ease with 
which simple hypotheses could be modeled and vali- 
dated. The original authors wrote about 2,000 lines of 
C code to identify attacks. The same validation was ex- 
pressed in about five lines as a behavior model. Addition- 
ally, this model can now be shared and easily modified 
and extended. 


5.2 Modeling Experiment Behavior 


Running experiments on a testbed, such as DETER [2], 
is challenging since it is hard to ascertain the validity of 
the experiment manually. With our framework, a model 
can be used to capture the “definition of validity” which 


includes possible successful and failed behaviors for an 
experiment and then confirmatory analysis can verify if 
it was met. Such a model can also be easily shared with 
other experimenters promoting sharing and reuse of ex- 
periments. 

We present an experiment emulating Dan Kaminsky’s 
popular DNS attack [7] using the metasploit [11] frame- 
work. Referring to Figure 5(a), the attackers objective is 
to poison the cache of the victimns so that any requests to 
eby.com are redirected to a fake nameserver (fakens) in- 
stead of the real nameserver (realns). We refer the reader 
to [7] for a detailed understanding of the attack. Since the 
attack exploits a race condition, our experiment setup has 
to permit successful occurrences as well as failed occur- 
rences of the attack. 

Figure 5(b) captures the experiment behavior as a tree 
of possibilities where the nodes are the experiment states 
and the paths connecting the states are possible experi- 
ment behaviors. These states are not exhaustive but suffi- 
cient to capture most of the semantics of the experiment. 
Specifically, we see that there are three possible behav- 
iors that can lead to failures and one behavior that can 
lead to success. 

The behavior model script is shown in Figure 5(c). 
Lines 2-4 define the model as DNSKAMINSKkyY over events 
of type PKT_DNS. Line 5 imports the DNSREQRES model 
that already defines states and behaviors relevant to the 
DNS protocol. 

Lines 7—17 define five different states that are relevant 
to the experiment. Line 8 defines the first DNS query 
from attacker to victim and provides a context for further 
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Summary : DNSKAMINSKY_SUCESS 


PKT_DNS | 1275515488 | 10.1.1 | | | 
PKT DNS | 1275515488 | 10.1.4. | | | 
PKT DNS | 1275515488 | 10.1.6. | 53 | 32778 | 59439 |fakens.fakeeby.com 
PKT DNS | 1275515488 | 10.1.6. | | | 59439 |realns.eby.com 


Total Matching Instances: 622 


<truncated output> 


PKT_DNS 1275515486 
PKT_DNS 1275515486 
PKT_DNS 1275515486 
PKT_DNS 1275515486 


| | 

| | 

| | 15578 |realns.eby.com 
| | 

PKT DNS | 1275515486 | 10. 

| | 

| | 

| | 

| | 

| | 


| 
| 
| 

| 47217 |fakens.fakeeby.com 

32778 | 47217 |fakens.fakeeby.com 
PKT DNS | 1275515486 | 
PKT DNS | 1275515486 | 
PKT DNS | 1275515486 | 
PKT DNS | 
| 


47217 |fakens.fakeeby.com 

47217 |fakens.fakeeby.com 

47217 |fakens.fakeeby.com 

1275515486 47217 |fakens.fakeeby.com 
PKT_DNS 1275515486 47217 |fakens.fakeeby.com 
PKT DNS | 1275515486 | 10.1.11.2 | 10.1.4.2 | 28902 | 53 | 50921 

<truncated output> 


Figure 6: Behavior instances satisfying the DNSKAMINSKY model. 


states. Line 10 defines a query from the victim to real 
nameserver by requiring that the source IP address of this 
query be same as the destination IP address of the previ- 
ous query and the DNS questions of both states be iden- 
tical. This makes sure that the forwarded query by the 
victim nameserver is the same as the one received. Line 
12 defines the response from the real nameserver to the 
victim nameserver. The response is related to the request 
in line 10 by using the state identifier of the query state 
VtoR_query. To specifically distinguish this response 
from the attacker’s response, we mention the value of the 
dnsauth attribute that is expected in the response. There 
are two cases for specifying the attacker’s response. Line 
14 defines the attacker’s response same as the real name- 
server response except that we mention the fake name- 
server as value of the dnsauth attribute. Line 16 defines 
the case where the attacker’s response is incorrect due 
to a wrongly guessed DNS transaction id. The bcount 
constraint specifies that any number of responses can be 
matched since the attacker can send multiple forged re- 
sponses. Attribute values not defined in the above states 
default to their definitions in DNSREQRES. 

Lines 19-22 specify four possible behaviors corre- 
sponding to the four different paths in Figure 5(b). Line 
20 uses the xor operator to merge two behavior paths. 
The other behaviors use the ~» operator to capture the 
causation between the states. Finally, the behavior model 
is defined in the model section using FAILURE and 
SUCCESS behaviors. Referring to Figure 5(b), we see 
that b_1 and b_2, where b_1 is a composite of two be- 
haviors, lead to FAILURE and b_3 leads to SUCCESS. 
By default, the framework composes the final model by 
or’ ing the behaviors specified in the model section. 

After running the experiment and capturing DNS 
packets, we normalize the last 10,000 packets to 
PKT_DNS events since they contain a successful attack 
along with failures representative of rest of the capture. 
The framework outputs one SUCCESS instance and 622 
FAILURE instances as shown in Figure 6. 
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- scan A {etype=SCAN, src=Sinfect_A.host, dst=$$} 
- infect A {etype=INFECT, host=$scan_A.dst} 

- Single spread = (scan _A ~> infect _A) 

- spread_chain = (single spread ~> spread chain) 

- WORMSPREAD(host) = (spread chain) 


OP WN 


(a) Modeling the worm infection chain over IDS alerts. 


IMPORT = NET.APP_PROTO.HTTP 

http _pkt = HTTP.HTTP PKT(sip=$$, dip=$$) 
attack_event = {etype=DOSATTACK,src=$$,dst=http pkt.dip} 
http _stream_at1l00 = ((http_pkt)[rate=100]) 

http _stream_below50 = ((http_pkt)[rate=0:50]) 

attack _start=(http_ stream_at100 ew[<= 5s](attack_event) ) 
DYNAMIC CHANGE = (attack_start ~> http _stream_below50) 


NOU BWN FE 


(b) Modeling change in rate of packet streams. 


IMPORT = NET.ATTACKS.DNSKAMINSKY,NET.ATTACKS .WORMSPREAD 
worm_attack= WORMSPREAD.single spread(host=$$) 
dns_attack = DNSKAMINSKY.SUCCESS(sip=$worm_attack.host) 
4. COMBINED ATTACK = (worm_attack ~> (dns_ attack) ) 


(c) Modeling an attack by composing WORMSPREAD and 
DNSKAMINSKY models. 


WN Fr 
o ee 


Figure 7: Excerpts from behavior models for (a) modeling a security 
threat, (b) modeling a dynamic change and (c) composing higher-level 
models. We refer the reader to the framework webpage [17] for details. 


This case study demonstrates the ease with which the 
full system behavior was semantically modeled at the 
level of user’s understanding. Additionally, the model 
was composed using existing models from the knowl- 
edge base, extended with user’s context-specific values 
for attributes and then validated. 


5.3. Modeling a Security Threat 


In this case study, we define a behavior model of a typical 
worm spread detected by IDS alerts collected from mul- 
tiple hosts. Assume a network with IDSes on each host 
reporting two types of timestamped alerts: a SCAN alert 
when a scan is detected by a host and an INFECT alert 
when the host is found infected. Assume an event log 
created by normalizing the alerts to two types of events 
with their corresponding attributes. Given the event log, 
our objective here 1s to define a behavior model to extract 
all possible infection chains of any length and report the 
hosts involved. 


We model the worm spread behavior as shown 
in Figure 7(a) in two stages; by first defining a 
single_spread behavior using events from a sin- 
gle host and then defining the spread_chain as a 
chain of related single_spread occurrences. The 
single_spread behavior, concerning a vulnerable host 
A, 1S a Sequence of two dependent and casual events: (a) a 
scan_A event with its src attribute pointing to an earlier 
infected host, followed by (b) an infect _A event with 
its host attribute the same as scan_A.dst. A worm 
spread chain (spread_chain) is then simply defined by 
a recursive occurrence of related single_spread be- 
haviors. Referring to the model, the forward-dependent 
attribute src in the definition of scan_A connects suc- 
cessive single_spread behaviors by requiring the src 
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of the next scan to be the same as the previously infected 
host. The forward-dependent attribute src 1s initialized 
automatically the first ttme single_spread 1s parsed by 
considering it to be a dynamic ($$) variable. The next 
iteration over spread_chain then uses the values as de- 
termined dynamically by single_spread. 


5.4 Modeling Dynamic Change 


Dynamic changes are a fundamental characteristic of 
networked and distributed environments. One example 
of a dynamic change is the change in rate of a stream 
of packets due to an anomalous condition such as a DoS 
attack. Our objective in this case study is to model an 
expected reduction in the rate of legitimate HTTP traffic 
due to DoS attack on a server. Our raw data consists of 
IDS DoS attack alerts and HTTP packets. 


The DYNAMIC_CHANGE model, containing only the 
relevant aspects is described in Figure 7(b). Line 2 de- 
fines a state capturing a HTTP packet between a source 
and destination. Line 3 defines a state capturing a DoS 
attack alert, additionally requiring the destination to be 
same as the destination in the HTTP packet. Lines 4 and 
5 describe the HTTP packet stream rates before and af- 
ter the attack respectively. The change boundary is de- 
fined by the attack_event that is triggered once the 
attack starts. Since attack_event represents a single 
event, it has the same starttime and endtime. Line 6 use 
the ew (endswith) operator to define the attack_start 
condition, which specifies that the http_stream_at100 
behavior end within five seconds of the attack_event. 
The DYNAMIC_CHANGE model is then an assertion that 
the HTTP stream rate reduces following the attack. 


5.5 Composing Models 


Our final case study demonstrates the ease of compos- 
ing and extending existing models to define semantically 
relevant higher-level behavior. 


We combine our previously defined mod- 
els DNSKAMINSKY and WORMSPREAD to create a 
COMBINED_ATTACK scenario as shown in Figure 7(c). 
Line 2 captures the behavior where a worm infects a 
host machine and scans and infects another host. Line 
3 describes the behavior where the worm launches a 
DNS Kaminsky attack on some DNS server from the 
last infected host. We do not specify any server for the 
DNS Kaminsky attack due to the abstractness of the 
DNSKAMINSKY model which infers the destination dy- 
namically. Line 4 is the final behavior model combining 
both the attacks. In line 3, we only constrain the sip 
and leave other attributes unspecified. This demonstrates 
the ability to extend the imported models with only 
the desired attribute values while leaving the others as 
defined in the imported model. 


60 b1 =cState —+— 


b2 =iState ------- 

b3 = iState ~> iState ----™---- 

50 ; b4 = iState ~> dState ote 
b5 = iState ~> dState ~> dState ~> dState --=-- / 
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Figure 8: Plot of runtime against number of events for five types 
of behavior complexity. Behaviors containing dependent value states 
(dStates) result in quadratic complexity. 


6 Performance Analysis 


A common approach for semantic-level analysis involves 
use of custom scripts or tools encoding context-specific 
semantics. Since custom scripts and tools can be written 
using a variety of programming and optimization tech- 
niques, any evaluation of our generic framework against 
them would be very subjective and thus flawed. Instead, 
we choose to report the raw runtime performance of our 
prototype implementation on five basic analyses tasks 
over event datasets of increasing size. 

The runtime performance of the framework depends 
on the language constructs, input data, analysis algorithm 
and implementation mechanisms used. Since our pri- 
mary focus in this paper is on enabling semantic func- 
tionality, we prototyped the framework in Python using a 
SQLite database as backend for storing events. The input 
events used were PKT_DNS events collected for the case 
study in Section 5.2. The performance analysis was con- 
ducted on a laptop with an Intel Penttum-M processor 
running at 1.86 GHz and with a memory of 2 GB. 

We measure runtime as a function of two variables: 
(a) the number of events input to the algorithm, (b) the 
behavior complexity, defined as the processing complex- 
ity of state propositions in a behavior formula. As dis- 
cussed in Section 3.3, there are three types of state propo- 
sitions based on attribute assignments; constant value at- 
tributes denoted as cState, dependent value attributes de- 
noted as dState, and dynamic attribute values denoted as 
iState. These states can be combined to form five ba- 
sic behaviors, each representing a basic semantic anal- 
ysis task: bl = (cState), represents extracting events 
with known attributes and values; b2 = (iState), repre- 
sents extracting events with particular attributes but un- 
known values; b3 = (iState ~» iState), represents extract- 
ing causally correlated yet value-independent events; b4 
= (iState ~ dState), represents extracting causally cor- 
related and value-dependent events; and b5 = (iState ~~ 
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dState ~~ dState ~~» dState), represents extracting a long 
chain of causal events. Although we limit our analysis 
to the ~ operator, all operators incur uniform process- 
ing overhead in the algorithm, thus resulting in similar 
performance results. The chosen event set along with the 
behaviors are representative of a worst-case input to the 
framework. We measure the performance using above 
behaviors over event sets in increments of 10,000 events. 
We stop at the event set when runtime exceeds 60 min- 
utes. 

The results are averaged over three runs and are shown 
in Figure 8. The plots for behaviors consisting of cStates 
and iStates b1, b2 and b3 tend to be linear as discussed in 
Section 4.4. One would expect that behavior b5, contain- 
ing three dStates would show significantly higher run- 
time than behavior b4 containing only one dState. Both 
show quadratic performance, since, in a chain of depen- 
dent states, the states further in the chain process lesser 
events than states in front of the chain. We thus see that 
runtime quickly becomes quadratic given a worst-case 
set of events and behaviors containing dependent state 
propositions. The current Python and SQLite-based im- 
plementation also add penalty to the framework runtime. 
We investigate these issues as part of our future work. 


7 Conclusion and Future Work 


In this paper, we presented a behavior-based semantic 
analysis framework that allows the user to analyze data 
at a higher-level of abstraction. Typically, system experts 
rely on their intuition and experience to manually ana- 
lyze and categorize scenarios and then hand-craft rules 
and patterns for analysis. Hence due to the manual and 
ad-hoc nature of this analysis process, there is limited 
extensibility and composibility of analysis strategies. In 
this paper we show that our approach is more system- 
atic, can retain expert knowledge, and supports compos- 
ing behaviors from existing models. We evaluated the 
utility of our framework against five analyses scenarios 
which demonstrated the ease with which a user’s higher- 
level understanding of system operation was expressed 
as behavior models over data. 

Our future work includes investigating the scale and 
efficiency issues that arise during processing large vol- 
umes of data in both offline and real-time settings like 1n- 
trusion detection. We will investigate stream-based SQL 
query extensions [6] to improve performance. We will 
also investigate extending our logic with existential and 
universal quantifiers. Currently, our framework requires 
a user to either manually specify behavior models or use 
existing models from the knowledge base to explore data. 
To further exploratory analysis, we would need to alert 
users to interesting unanticipated behaviors. We are ex- 
ploring data mining algorithms to automatically discover 
and compose behavior models from data. 
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The fundamental goal of the behavior-based semantic 
analysis framework is to introduce a semantic approach 
to data analysis in networked and distributed systems re- 
search and operations. We hope that this paper serves as 
a catalyst for further research on semantic data analysis. 
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Abstract 


Conventional wisdom holds that Paxos is too expensive to use for high-volume, high-throughput, data-intensive 
applications. Consequently, fault-tolerant storage systems typically rely on special hardware, semantics weaker than 
sequential consistency, a limited update interface (such as append-only), primary-backup replication schemes that 
serialize all reads through the primary, clock synchronization for correctness, or some combination thereof. We 
demonstrate that a Paxos-based replicated state machine implementing a storage service can achieve performance 
close to the limits of the underlying hardware while tolerating arbitrary machine restarts, some permanent machine 
or disk failures and a limited set of Byzantine faults. We also compare it with two versions of primary-backup. The 
replicated state machine can serve as the data store for a file system or storage array. We present a novel algorithm 
for ensuring read consistency without logging, along with a sketch of a proof of its correctness. 


1. Introduction 

Replicated State Machines (RSMs) [31, 35] provide 
desirable semantics, with operations fully serialized 
and durably committed by the time a result is re- 
turned. When implemented with Paxos [20], they 
also tolerate arbitrary computer and process restarts 
and permanent stopping faults of a minority of com- 
puters, with only very weak assumptions about the 
underlying system--essentially that it doesn’t exhibit 
Byzantine [22] behavior. Conventional wisdom 
holds that the cost of obtaining these properties is too 
high to make Paxos RSMs useful in practice for ap- 
plications that require performance. For instance, 
Birman [4] writes: 


Given that it offers stronger failure guarantees, 
why not just insist that all multicast primitives 
be dynamically uniform [his term for what 
Paxos achieves]? ... From a theory perspec- 
tive, it makes sense to do precisely this. Dy- 
namic uniformity is a simple property to formal- 
ize, and applications using a dynamically uni- 
form multicast layer are easier to prove cor- 
rect. 


But the bad news is that dynamic uniformity is 
very costly [emphasis his]. 


On the other hand, there are major systems 
(notably Paxos...) in which ... dynamic uni- 
formity is the default. ... [T]he cost is so high 
that the resulting applications may be unac- 
ceptably sluggish. 


We argue that at least in the case of systems that are 
replicated over a local area network and have opera- 
tions that often require using hard disks, this simply 
is not true. The extra message costs of Paxos over 
other replication techniques are overwhelmed by the 
roughly two orders of magnitude larger disk latency 
that occurs regardless of the replication model. Fur- 
thermore, while the operation serialization and com- 
mit-before-reply properties of Paxos RSMs seem to 
be at odds with getting good performance from disks, 
we show that a careful implementation can operate 
disks efficiently while preserving Paxos’ sequential 
consistency. Our measurements show that a Paxos 
RSM that implements a virtual disk service has per- 
formance close to the limits of the underlying hard- 
ware, and better than primary-backup for a mixed 
read-write load. 


The current state of the art involves weakened se- 
mantics, stronger assumptions about the system, re- 
stricted functionality, special hardware support or 
performance compromises. For example, the Google 
File System [13] uses append-mostly files, weakens 
data consistency and sacrifices efficiency on over- 
writes, but achieves very good performance and scale 
for appends and reads. Google’s Paxos-based imple- 
mentation [8] of the Chubby lock service [5] relies on 
clock synchronization to avoid stale reads and re- 
stricts its state to fit in memory; its published perfor- 
mance is about a fifth of ours’. Storage-area network 
(SAN) based disk systems often use special hardware 


' Though differences in hardware limit the value of 
this comparison. 
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such as replicated battery-backed RAM to achieve 
fault tolerance, and are usually much more costly 
than ordinary computers, disks and networks. There 
are a number of flavors of primary-backup replication 
[4], but typically these systems run at the slower rate 
of the primary or the median backup, and may rely on 
(often loose) clock synchronization for correctness. 
Furthermore, they typically read only from the prima- 
ry, which at worst wastes the read bandwidth of the 
backup disks and at best is unable to choose where to 
send reads at runtime, which can result in unneces- 
sary interference of writes with reads. Many Byzan- 
tine-fault tolerant (BFT) [1, 9, 18] systems do not 
commit operations to stable storage before returning 
results, and so cannot tolerate system-wide power 
failures without losing updates. In contrast, our Pax- 
os-based RSM runs on standard servers with directly 
attached disks and an ordinary Ethernet switch, 
makes no assumptions about clock synchronization to 
ensure correctness, delivers random read _perfor- 
mance that grows nearly linearly in the number of 
replicas and random write performance that is limited 
by the performance of the disks and the size of the 
write reorder buffer, but is not affected by the dis- 
tributed parts of the system. It performs 12%-69% 
better than primary-backup replication on an online 
transaction processing load. 


The idea of an RSM is that if a computation is deter- 
ministic, then it can be made fault-tolerant by running 
copies of it on multiple computers and feeding the 
same inputs in the same order to each of the replicas. 
Paxos is responsible for assuring the sequence of 
operations. We modified the SMART [25] library 
(which uses Paxos) to provide a framework for 1m- 
plementing RSMs. SMART stored its data in SQL 
Server [10]; we replaced its store and log and made 
extensive internal changes to improve its perfor- 
mance, such as combining the Paxos log with the 
store’s log. We also invented a new protocol to order 
reads without requiring logging or relying on time for 
correctness. To differentiate the original version of 
SMART from our improved version, we refer to the 
new code as SMARTER’. We describe the changes 
to SMART and provide a sketch of a correctness 
proof for our read protocol. 


Disk-based storage systems have high operation la- 
tency (often >10ms without queuing delay) and per- 
form much better when they’re able to reorder re- 
quests so as to minimize the distance that the disk 
head has to travel [39]. On the face of it, this is at 
odds with the determinism requirements of an RSM: 
If two operations depend on one another, then their 


* SMART, Enhanced Revision. 
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order of execution will determine their result. Reor- 
dering across such a dependency could in turn cause 
the replicas’ states to diverge. We address this prob- 
lem by using IO parallelism both before and after the 
RSM runs, but by presenting the RSM with fully se- 
rial inputs. This is loosely analogous to how out-of- 
order processors [37] present a sequential assembly 
language model while operating internally in parallel. 


This paper presents Gaios’, a reliable data store con- 
structed as an RSM using SMARTER. Gaios can be 
used as a reliable disk or as a stream store (something 
like the i-node layer of a file system) that provides 
operations like create, delete, read, (over-)write, ap- 
pend, extend and truncate. We wrote a Windows 
disk driver that uses the Gaios RSM as its store, cre- 
ating a small number of large streams that store the 
data of a virtual disk. While it is beyond the scope of 
this paper, one could achieve scalability in both per- 
formance and storage capacity by running multiple 
instances of Gaios across multiple disks and nodes. 


We use both microbenchmarks and an _ industry 
standard online transaction processing (OLTP) 
benchmark to evaluate Gaios. We compare Gaios 
both to a local, directly attached disk and to two vari- 
ants of primary-backup replication. We find that 
Gaios exposes most of the performance of the under- 
lying hardware, and that on the OLTP load it outper- 
forms even the best case version of primary-backup 
replication because SMARTER is able to direct reads 
away from nodes that are writing, resulting in less 
interference between the two. 


Section 2 describes the Paxos protocol to a level of 
detail sufficient to understand its effects on perfor- 
mance. It also describes how to use Paxos to imple- 
ment replicated state machines. Section 3 presents 
the Gaios architecture in detail, including our read 
algorithm and its proof sketch. Section 4 contains 
experimental results. Section 5 considers related 
work and the final section is a summary and conclu- 
sion. 


2. Paxos Replicated State Ma- 
chines 


A state machine is a deterministic computation that 
takes an input and a state and produces an output and 
a new state. Paxos is a protocol that results in an 
agreement on an order of inputs among a group of 
replicas, even when the computers in the group crash 


* Gaios is the capital and main port on the Greek is- 
land of Paxos. 
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and restart or when a minority of computers perma- 
nently fail. By using Paxos to serialize the inputs of 
a state machine, the state machine can be replicated 
by running a copy on each of a set of computers and 
feeding each copy the inputs in the order determined 
by Paxos. 


This section describes the Paxos protocol in sufficient 
detail to understand its performance implications. It 
does not attempt to be a full description, and in par- 
ticular gives short shrift to the view change algo- 
rithm, which is by far the most interesting part of 
Paxos. Because view change happens only rarely and 
is inexpensive when it does, it does not have a large 
effect on overall system performance. Other papers 
[20, 21, 23] provide more in-depth descriptions of 
Paxos. 


2.1 The Paxos Protocol 

As SMART uses it, Paxos binds requests that come 
from clients to s/ots. Slots are sequentially num- 
bered, starting with 1. A state machine will execute 
the request in slot 1, followed by that in slot 2, etc. 
When thinking about how SMART works, it is help- 
ful to think about two separate, interacting pieces: 
the Agreement Engine and the Execution Engine. 
The Agreement Engine uses Paxos to agree on an 
Operation sequence, but does not depend on the state 
machine’s state. The Execution Engine consumes the 
agreed-upon sequence of operations, updates the state 
and produces replies. The Execution Engine does not 
depend on a quorum algorithm because its input is 
already linearized by the Agreement Engine. 


The protocol attempts to have a single computer des- 
ignated as leader at any one time, although it never 
errs regardless of how many computers simultane- 
ously believe they are leader. We will ignore the 
possibility that there is not exactly one leader at any 
time (except in the read-only protocol proof sketch in 
Section 3.3.2) and refer to the leader, understanding 
that this is a simplification. Changing leaders (usually 
in response to a slow or failed machine) is called a 
view change. View changes are relatively light- 
weight; consequently, we set the view change 
timeout in SMART to be about 750ms and accept 
unnecessary view changes so that when the leader 
fails, the system doesn’t have to be unresponsive for 
very long. By contrast, primary-backup replication 
algorithms often have to wait for a lease to expire 
before they can complete a view change. In order to 
assure correctness, the lease timeout must be greater 
than the maximum clock skew between the nodes. 


Figure 1 shows the usual message sequence for a 
Paxos read/write operation, leaving out the computa- 


tion and disk IO delays. When a client wants to 
submit a read/write request, it sends the request to the 
leader (getting redirected if it’s wrong about the cur- 
rent leader). The leader receives the request, selects 
the lowest unused slot number and sends a proposal 
to the computers in the Paxos group, tentatively bind- 
ing the request to the slot. The computers that re- 
ceive the proposal write it to stable storage and then 
acknowledge the proposal back to the leader. When 
more than half of the computers in the group have 
written the proposal (regardless of whether the leader 
is among the set), it is permanently bound to the slot. 
The leader then informs the group members that the 
proposal has been decided with a commit message. 
The Execution Engines on the replicas process com- 
mitted requests in slot number order as they become 
available, updating their state and generating a reply 
for the client. It is only necessary for one of them to 
send a reply, but it is permissible for several or all of 
them to reply. The dotted lines on the reply messages 
in Figure 1 indicate that only one of them is neces- 
sary. 


Client Leader Follower Follower 


Propose 





Figure 1: Read/Write Message Sequence 


When the write to stable storage is done using a disk 
and the network is local, the disk write is the most 
expensive step by a large margin. Disk operations 
take milliseconds or even tens of milliseconds, while 
network messages take tens to several hundred mi- 
croseconds. This observation led us to create an al- 
gorithm for read-only requests that avoids the logging 
step but uses the same number of network messages. 
It is described in section 3.3.2 


2.2 Implementing a_ Replicated 
State Machine with Paxos 


There are a number of complications in building an 
efficient replicated state machine, among them avoid- 
ing writing the state to disk on every operation. 
SMART and Google’s later Paxos implementation 
[8] solve this problem by using periodic atomic 
checkpoints of the state. SMART (unlike Google) 
writes out only the changed part of the state. Ifa 
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node crashes other than immediately after a check- 
point, it will roll back its state and re-execute opera- 
tions, which is harmless because the operations are 
deterministic. Both implementations also provide for 
catching up a replica by copying state from another, 
but that has no performance implication in normal 
operation and so is beyond the scope of this paper. 


3. Architecture 

SMARTER is at the heart of the Gaios system as 
shown in Figure 2. It is responsible for the Paxos 
protocol and overall control of the work flow in the 
system. One way to think of what SMARTER does 
is that it implements an asynchronous Remote Proce- 
dure Call (RPC) where the server (the state machine) 
runs on a fault-tolerant, replicated system. 





Application Standard App 
written for 
~ User 
oe Kernel 
NTFS 
Gaios Disk 
Driver 
SMARTER Client 


Network 


SMARTER Server 
| 
Gaios 


Stream - 
Log Store 
XN 
User 
Kernel 


Figure 2: Gaios Architecture 


Gaios’s state machine implements a stream store. 
Streams are named by 128-bit Globally Unique IDs 
(GUIDs) and contain of a sparse array of bytes. The 
interface includes create, delete, read, write, and 
truncate. Reads and writes may be for a portion of a 
stream and include checksums of the stream data. 


SMARTER uses a custom log to record Paxos pro- 
posals and the Local Stream Store (LSS) to hold state 
machine state and SMARTER’s internal state. The 
system has two clients, one a user-mode library that 
exposes the functions of the Gaios RSM and the se- 
cond a kernel-mode disk driver that presents a logical 
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disk to Windows, and backs the disk with streams 
stored in the Gaios RSM. 


3.1 SMARTER 


Among the changes we made to SMART“ were to 
present a pluggable interface for storage and log pro- 
viders, rather than having SQL Server hardwired for 
both functions; to have a zero-copy data path; to al- 
low IO prefetching at proposal time; to batch client 
operations; to have a parallel network transport and 
deal with the frequent message reorderings that that 
produces; to detect and handle some hardware errors 
and non-determinism; and to have a more efficient 
protocol for read-only requests. SMARTER per- 
forms the basic Paxos functions: client, leadership, 
interacting with the logging subsystem and RSM, 
feeding committed operations to the RSM, and man- 
aging the RSM state and sending replies to the client. 
It is also responsible for other functions such as view 
change, state transfer, log trimming, efc. 


The SMARTER client pipelines and batches requests. 
Pipelining means that it can allow multiple requests 
to be outstanding simultaneously. In the implementa- 
tion measured in this paper, the maximum pipeline 
depth is set to 6, although we don’t believe that our 
results are particularly sensitive to the value. Batch- 
ing means that when there are client requests waiting 
for a free pipeline slot, SMARTER may combine 
several of them into a single composite request. 


Unlike in primary-backup replication systems, 
SMART does not require that the leader be among 
the majority that has logged the proposal; any majori- 
ty will do. This allows the system to run at the speed 
of the median member (for odd sized configurations). 
Furthermore, there is no requirement that the majori- 
ty set for different operations be the same. Neverthe- 
less all Execution Engines will see the same binding 
of operations to slots and all replicas will have identi- 
cal state at a given slot number. 


The leader’s network bandwidth could become a bot- 
tleneck when request messages are large. In this case 
SMARTER forwards the propose messages in a chain 
rather than sending them directly as shown in Figure 
1. Because the sequential access bandwidth of a disk 
is comparable to the bandwidth of a gigabit Ethernet 
link, this optimization 1s often important. 


* When we refer to “SMART” in the text, we mean 
either the original system, or to a part of SMARTER 
that is identical to it. 
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3.2 The Local Stream Store 

Gaios uses a custom store called the Local Stream 
Store for its data (but not for its log). The LSS in 
turn uses a single, large file in NTFS against which it 
runs non-cached IO. 


The LSS writes in a batch mode. It takes requests, 
executes them in memory, and then upon request 
atomically checkpoints its entire state. The LSS is 
designed so that it can overlap (in-memory) operation 
execution with most of the process of writing the 
checkpoint to disk, so there is only a brief pause in 
execution when a checkpoint is initiated. 


The LSS maintains checksums for all stream data. 
The checksum algorithm is selectable; we used 
CRC32 [17] for all experiments in this paper, result- 
ing in 4 bytes of checksum for 4K of data, or 0.1% 
overhead. The checksums are stored separately from 
the data so that all accesses to data and its associated 
checksum happen in separate disk IOs. This is 1m- 
portant in the case that the disk misdirects a read or 
write, or leaves a write unimplemented [3]. No sin- 
gle misdirected or unimplemented IO will undetecta- 
bly corrupt the LSS. Checksums are stored near each 
other and are read in batches, so few seeks are needed 
to read and write the checksums. 


The LSS provides deterministic free space. Regard- 
less of the order in which IOs complete and when and 
how often the store is checkpointed, as long as the set 
of requests is the same the system will report the 
same amount of free space. This is important for 
RSM determinism, and would be a real obstacle with 
a store like NTFS [28] that is subject to space use by 
external components and in any case is not determin- 
istic in free space. 


3.2.1 Minimizing Data Copies 

Because SMART used SQL Server as its store, it 
wrote each operation to the disk four times. When 
logging, it wrote a proposed operation into a table 
and then committed the transaction. This resulted in 
two writes to the disk: one into SQL’s transaction log 
and a second one to the table. The state machine 
state was also stored in a set of SQL tables, so any 
changes to the state because of the operation were 
likewise written to the disk twice. 


For a service that had a low volume of operations this 
wasn’t a big concern. However, for a storage service 
that needs to handle data rates comparable to a disk’s 
100 MB/s it can be a performance limitation. Elimi- 
nating one of the four copies was easy: We imple- 
mented the proposal store as a log rather than a table. 


Once the extra write in the proposal phase was gone, 
we were left with the proposal log, the transaction log 
for the final location and the write into the final loca- 
tion. We combined the proposal log and the transac- 
tion log into a single copy of the data, but it required 
careful thinking to get it right. Just because an opera- 
tion is proposed does not mean that it will be execut- 
ed; there could be a view change and the proposal 
may never get quorum. Furthermore, RSMs are not 
required to write any data that comes in an opera- 
tion—they can process it in any way they want, for 
example maintaining counters or storing indices, so 
it’s not possible to get rid of the LSS’s transaction 
log entirely. 


We modified the transaction log for the LSS to allow 
it to contain pointers into the proposal log. When the 
LSS executes a write of data that was already in the 
proposal log, it uses a special kind of transaction log 
record that references the proposal log and modifies 
the proposal log truncation logic accordingly. The 
necessity for the store to see the proposal log writes 
is why it’s shown as interposing between SMARTER 
and the log in Figure 2. In practice in Gaios data is 
written twice, to the proposal log and to the LSS’s 
store. 


It would be possible to build a system that has a sin- 
gle-write data path. Doing this, however, runs into a 
problem: Systems that do atomic updates need to 
have a copy of either the old or new data at all times 
so that an interrupted update can roll forward or 
backward [14]. This means that, in practice, single- 
write systems need to use a write-to-new store rather 
than an overwriting store. Because we wanted Gaios 
efficiently to support database loads, and because 
databases often optimize the on-disk layout assuming 
it is in-order, we chose not to build a single-write 
system. This choice has nothing to do with the replhi- 
cation algorithm (or, in fact, SMARTER). If we re- 
placed the LSS with a log-structured or another 
write-to-new store we could have a single-write path. 


3.3 Disk-Efficient Request Pro- 


cessing 

State machines are defined in terms of handling a 
single operation at a time. Disks work best when 
they are presented with a number of simultaneous 
requests and can reorder them to minimize disk arm 
movements, using something like the elevator 
(SCAN) algorithm [12] to reduce overall time. Rec- 
onciling these requirements is the essence of getting 
performance from a state-machine based data store 
that is backed by disks. 
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Gaios solves this problem differently for read-only 
and read-write requests. Read-write requests do their 
writes exclusively into in-memory cache, which is 
cleaned in large chunks at checkpoint time in a disk- 
efficient order. Read-only requests (ordinarily) run on 
only one replica. As they arrive, they are reordered 
and sent to the disk in a disk efficient manner, and 
are executed once the disk read has completed in 
whatever order the reads complete. 


3.3.1 Read-Write Processing 

SMART?’s handling of read-write requests is in some 
ways analogous to how databases implement transac- 
tions [14]. The programming model for a state ma- 
chine is ACID (atomic, consistent, isolated and dura- 
ble), while the system handles the work necessary to 
operate the disk efficiently. In both, atomicity is 
achieved by logging requests, and durability by wait- 
ing for the log writes to complete before replying to 
the user. In both, the system retires writes to the non- 
log portion of the disk efficiently, and trims the log 
after these updates complete. 


Unlike databases, however, SMART achieves isola- 
tion and consistency by executing only one request at 
a time in the state machine. This has two benefits: It 
ensures determinism across multiple replicas; and, it 
removes the need to take locks during execution. 
The price is that if two read-write operations are in- 
dependent of one another, they still have to execute 
in the predetermined order, even if the earlier one has 
to block waiting for IO and the later one does not. 


SMARTER exports an interface to the state machine 
that allows it to inspect an operation prior to execu- 
tion, and to initiate any cache prefetches that might 
help its eventual execution. SMARTER calls this 
interface when it first receives a propose message. 
This allows the local store to overlap its prefetch with 
logging, waiting for quorum and any other operations 
serialized before the proposed operation. It is possi- 
ble that a proposed operation may never reach quor- 
um and so may never be executed. Since prefetches 
do not affect the system state (just what is in the 
cache), incorrect prefetches are harmless. 


During operation execution, any reads in read/write 
operations are likely to hit in cache because they’ve 
been prefetched. Writes are always applied in 
memory. Ordinarily writes will not block, but if the 
system has too much dirty memory SMARTER will 
throttle writes until the dirty memory size is suffi- 
ciently small. The local stream store releases dirty 
memory as it is written out to the disk rather than 
waiting until the end of a flush, so write throttling 
does not result in a large amount of jitter. 
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3.3.2 Read-Only Processing 

SMARTER uses five techniques to improve read- 
only performance: It executes a particular read-only 
operation on only one replica; it uses a novel agree- 
ment protocol that does not require logging; it reor- 
ders the reads into a disk-efficient schedule, subject 
to ordering constraints to maintain consistency; it 
spreads the reads among the replicas to leverage all 
of the disk arms; and, it tries to direct reads away 
from replicas whose LSS is writing a checkpoint, so 
that reads aren’t stuck behind a queue of writes. 


Since a client needs only a single reply to an opera- 
tion and read-only operations do not update state 
there is no reason to execute them on all replicas. 
Instead, the leader spreads the read-only requests 
across the (live), non-checkpointing replicas using a 
round-robin algorithm. By spreading the requests 
across the replicas, it shares the load on the network 
adapters and more importantly on the disk arms. For 
random read loads where the limiting factor is the 
rate at which the disk arms are able to move there is a 
slightly less than linear speedup in performance as 
more replicas are added (see Section 4). It is sub- 
linear because spreading the reads over more drives 
reduces read density and so results in longer seeks. 


When a load contains a mix of reads and writes, they 
will contend for the disk arm. It is usually the case 
that on the data disk reads are more important than 
writes because SMARTER acknowledges writes after 
they’ve been logged and executed, but before they’ ve 
been written to the data disk by an LSS checkpoint. 
Because checkpoints operate over a large number of 
writes it is common for them to have more sequenti- 
ality than reads, and so disk scheduling will starve 
reads in favor of writes. SMARTER takes two steps 
to alleviate this problem: It tries to direct reads away 
from replicas that are processing checkpoints, and 
when it fails to do that it suspends the checkpoint 
writes when reads are outstanding (unless the system 
is starving for memory, in which case it lets the reads 
fend for themselves). The leader is able to direct 
reads away from checkpointing replicas because the 
replicas report whether they’re in checkpoint both in 
their periodic status messages, and also in the 
MY VIEW _IS message in the read-only protocol, 
described immediately hereafter. 


A more interesting property of read-only operations 
is that to be consistent as seen by the clients, they do 
not need to execute in precise order with respect to 
the read/write operations. All that’s necessary is that 
they execute after any read/write operation that has 
completed before the read-only request was issued. 
That is, the state against which the read is run must 
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reflect any operation that any client has seen as com- 
pleted, but may or may not reflect any subsequent 
writes. 


SMARTER’s read-only protocol is as follows: 


1. Upon receipt of a read-only request by a 
leader, stamp it with the greater of the high- 
est operation number that the leader has 
committed in sequence and the highest oper- 
ation number that the leader re-proposed 
when it started its view. 

2. Senda WHATS MY VIEW message to all 
replicas, checking whether they have recog- 
nized a new leader. 

3. Wait for at least half of all replicas (includ- 
ing itself) to reply that they still recognize 
the leader; if any do not, discard the read- 
only request. 

4. Dispatch the read-only request to a replica, 
including the slot number recorded in step 1. 

5. The selected replica waits for the stamped 
slot number to execute, and then checks to 
see if a new configuration has been chosen. 
If so, it discards the request. Otherwise, it 
executes it and sends the reply to the client. 


In practice, SMARTER limits the traffic generated in 
steps 2 & 3 by only having one view check outstand- 
ing at a time, and batching all requests that arrive 
during a given view check to create a single subse- 
quent view check. We’ll ignore this for purposes of 
the proof sketch, however. 


SMARTER’s read-only protocol achieves the follow- 
ing property: The state returned by a read-only re- 
quest reflects the updates made by any writes for 
which any client is aware of a completion at the time 
the read is sent, and does not depend on clock syn- 
chronization among any computers in the system. In 
other words, the reads are never stale, even with an 
asynchronous network. 


We do not provide a full correctness proof for lack of 
space. Instead we sketch it; in particular, we ignore 
the possibility of a configuration change (a change in 
the set of nodes implementing the state machine), 
though we claim the protocol is correct even with 
configuration changes. 


Proof sketch: Consider a read-only request R sent by 
a client. Let any write operation W be given such 
that W has been completed to some client before R is 
sent. Because W has completed to a client, it must 
have been executed by a replica. Because replicas 
execute all operations in order and only after they’ve 
been committed, W and all earlier operations must 


have been committed before R was sent. W was ei- 
ther first committed by the leader to which R is sent 
(call it L), or by a previous or subsequent leader (ac- 
cording to the total order on the Paxos view ID). If it 
was first committed by a previous leader, then by the 
Paxos view change algorithm L saw it as committed 
or re-proposed it when L started; if W was first 
committed by L then L was aware of it. In either 
case, the slot number in step | is greater than or equal 
to W’s slot number. 


If W was first committed by a subsequent leader to 
L, then the subsequent leader must have existed by 
the time L received the request in step 1, because by 
hypothesis W had executed before R was sent. If 
that is the case, then by the Paxos view change algo- 
rithm a majority of computers in the group must have 
responded to the new view. At least one of these 
computers must have been in the set responding in 
step 3, which would cause R to be dropped. So, if R 
completes then W was not first committed by a lead- 
er subsequent to L. Therefore, if R is not discarded 
the slot number selected in step | is greater than or 
equal to W’s slot number. 


In step 5, the replica executing R waits until the slot 
number from step 1 executes. Since W has a slot 
number less than or equal to that slot number, W 
executes before R. Because W was an arbitrary write 
that completed before R was started SMARTER’s 
read-only protocol achieves the desired consistency 
property with respect to writes. The protocol did not 
refer to clocks and so does not depend on clock syn- 
chronizationm 


3.4 Non-Determinism 

The RSM model assumes that the state machines are 
deterministic, which implies that the state machine 
code must avoid things like relying on wall clock 
time. However, there are sources of non-determinism 
other than coding errors in the RSM. Ordinary pro- 
gramming issues like memory allocation failures as 
well as hardware faults such as detected or undetect- 
ed data corruptions in the disk [3], network, or 
memory systems [30, 36] can cause replicas to mis- 
behave and diverge. 


Divergent RSMs can lead to inconsistencies exposed 
to the user of the system. These problems are a sub- 
set of the general class of Byzantine faults [22], and 
could be handled by using a Byzantine-fault-tolerant 
replication system [7]. However, such systems re- 
quire more nodes to tolerate a given number of faults 
(at least 3f+/ nodes for f faults, as opposed to 2/+/ 
for Paxos [26]), and also use more network commu- 
nication. We have chosen instead to anticipate a set 
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of common Byzantine faults, detect them and turn 
them into either harmless system restarts or to stop- 
ping failures. The efficacy of this technique depends 
on how well we anticipate the classes of failures as 
well as our ability to detect and handle them. It also 
relies on external security measures to prevent male- 
factors from compromising the machines running the 
service (which we assume and do not discuss fur- 
ther). 


Memory allocation failures are a source of nondeter- 
minism. Rather than trying to force all replicas to fail 
allocations deterministically, SMART simply induces 
a process exit and restart, which leverages the fault 
tolerance to handle the entire range of allocation 
problems. 


In most cases, network data corruptions are fairly 
straightforward to handle. SMARTER verifies the 
integrity of a message when it arrives, and drops it if 
it fails the test. Since Paxos is designed to handle 
lost messages this may result in a timeout and retry of 
the original (presumably uncorrupted) message send. 
In a system with fewer than f failed components, 
many messages are redundant and so do not even 
require a retransmission. As long as network corrup- 
tions are rare, message drops have little performance 
impact. As an optimization, SMARTER does not 
compute checksums over the data portion of a client 
request or proposal message. Instead, it calls the 
RSM to verify the integrity of these messages. If the 
RSM maintains checksums to be stored along with 
the data on disk (as does Gaios), then it can use these 
checksums and save the expense of having them 
computed, transported and then discarded by the 
lower-level SMARTER code. 


Data corruptions on disk are detected either by the 
disk itself or by the LSS’s checksum facility as de- 
scribed in Section 3.2. SMARTER handles a detect- 
ed, uncorrectable error by retrying it and if that fails 
declaring a permanent failure of a replica and re- 
building it by changing the configuration of the 
group. See the SMART paper [25] for details of con- 
figuration change. 


In-memory corruptions can result in a multitude of 
problems, and Gaios deals with a subset of them by 
converting them into process restarts. Because Gaios 
is a store, most of its memory holds the contents of 
the store, either in the form of in-process write re- 
quests or of cache. Therefore, we expect at least 
those memory corruptions that are due to hardware 
faults to be more likely to affect the store contents 
than program state. These corruptions will be detect- 
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ed as the corrupted data fails verification on the disk 
and/or network paths. 


4. Experiments 

We ran experiments to compare Gaios to three differ- 
ent alternatives: a locally attached disk and two ver- 
sions of primary-backup replication. We ran micro- 
benchmarks to tease out the performance differences 
for specific homogeneous loads and an industry 
standard online transaction processing benchmark to 
show a more realistic mixed read/write load. We 
found that SMARTER’s ability to vector reads away 
from checkpointing (writing) replicas conveyed a 
performance advantage over primary-backup replica- 
tion. 


4.1 Hardware Configuration 

We ran experiments on a set of computers connected 
by a Cisco Catalyst 3560G gigabit Ethernet switch. 
The switch bandwidth is large enough that it was not 
a factor in any of the tests. 


The computers had three hardware configurations. 
Three computers (“old servers”) had 2 dual core 
AMD Opteron 2216 processors running at 2.4 GHz, 8 
GB of DRAM, four Western Digital WD7500AYYS 
7200 RPM disk drives (as well as a boot drive not 
used during the tests), and a dual port NVIDIA 
nForce network adapter, with both ports connected to 
the same switch. A fourth (“client”) had the same 
hardware configuration except that it had two quad- 
core AMD Opteron 2350 processors running at 2.0 
GHz. The remaining two (“new servers’) had 2 
quad-core AMD Opteron 2382 2.6 GHz processors, 
16 GB of DRAM, four Western Digital 
WS1002FBYS 7200 RPM 1 TB disk drives, and two 
dual port Intel gigabit Ethernet adapters. All of the 
machines ran Windows Server 2008 R2, Enterprise 
Edition. We ran the servers with a 128 MB memory 
cache and a dirty memory limit of 512 MB. We used 
such artificially low limits so that we could hit full- 
cache more quickly so that our tests didn’t take as 
long to run, and so that read-cache hits didn’t have a 
large effect on our microbenchmarks. 


4.2 Simulating Primary-Backup 

In order to compare Gaios to a primary-backup (P-B) 
replication system, we modified SMARTER in three 
ways: 


1. Reads are dispatched without the quorum 
check in the SMARTER read protocol, on 
the assumption that a leasing mechanism 
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would accomplish the same thing without 
the messages. 

2. Read/Write operation quorums must include 
the leader, so for example in a 3-node con- 
figuration if the two non-leader nodes finish 
their logging first the system will still wait 
for the leader. 

3. All read/write replies come only from the 
leader. 


Because we didn’t implement a leasing mechanism, 
the modified SMARTER might serve stale reads after 
a view change. We simply ignored this possibility 
for performance testing. 


Because P-B systems read only from the primary, 
they cannot take advantage of the random read per- 
formance of their backup nodes. The consequences 
of this may be limited by having many replication 
groups that spread primary duties (and thus read 
load) over all of the nodes. In the best case, they will 
uniformly spread their reads over all of the nodes as 
SMARTER does. 


To capture the range of possible read spreading in P- 
B systems we implemented two versions: worst and 
best cases. The worst case version is called PB1 be- 
cause it reads from only one node. It assumes that 
spreading is completely ineffective and sends all 
reads to the primary. The best case is called PBN 
and simulates perfect spreading by sending reads to 
all N nodes. Rather than implementing multiple 
groups, we simply used SMARTER’s existing read 
distribution algorithm, but without the quorum check 
and without the check to avoid sending reads to 
nodes that are checkpointing. 


The latter point is the crucial difference between the 
two systems. While PBN is able to use all of the disk 
arms for reads, it can’t dynamically select which arm 
to use for a particular read because it must send reads 
to the primary, and it achieves spreading only by dis- 
tributing the work of the primaries for many groups. 
Moving a primary is far too heavy-weight to do on 
each read. SMARTER, on the other hand, tries to 
move reads away from checkpointing replicas so that 
writes don’t interfere with reads. It also adds some 
randomness into the decision about when to check- 
point to avoid having replicas checkpoint in lockstep. 
In the mixed read/write transaction processing load 
measured in section 4.4 Gaios achieves 12% better 
performance tan PBN because of this ability (and is 
68% faster than PB1). 


4.3 Microbenchmarks 

We ran microbenchmarks on Gaios and P-B replica- 
tion as well as directly on an instance of each of the 
two types of disks used in our servers, varying the 
number of servers from 1 to 5. We expect that most 
applications would want to run with a group size of 3, 
though a requirement for greater fault tolerance or 
improved read performance argues for more replicas. 
In all of the experiments where we varied the degree 
of replication, we used the three old servers first fol- 
lowed by the two new servers, so for instance the 4 
replica data point has three old and one new server. 


We used the sqlio [33] tool running on NTFS over 
the Gaios disk driver (or directly on the local drive, 
as appropriate). | Gaios exported a 20 GB drive to 
NTFS and sqlio used a 10GB file. Gaios used two 
identical drives on each replica, one for log and one 
for the data store. Each data point is the mean of 10 
measurements and was taken over a five minute peri- 
od, other than the burst writes shown in Figure 4, 
which ran for 10 seconds. We ran all tests with the 
disks set to write through their cache, so all writes are 
durable. We ran the P-B variants only on two or 
more nodes because they’re identical to Gaios on one 
node, and we ran only one P-B variant on the write 
tests, since PB1 and PBN differ only for reads. 
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Figure 3: Random IO Performance 


Figure 3 shows the performance of 8 kilobyte random 
reads and writes. In this and the other microbench- 
mark figures, we show the results for the new server 
disks at the 4 replica position both to provide visual 
separation from the old replica disks and to help point 
out that at 4 replicas we started adding new servers to 
the mix. 
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The writes were measured with a dirty cache. Write 
performance does not vary much with degree of rep- 
lication or Gaios vs. P-B and is roughly 500 IO/s, a 
little more than twice the local disk’s. This is be- 
cause the server is able to reorder the writes in a disk- 
efficient manner over its 512MB of write buffer 
without the possibility of loss because the data is 
already logged, while the raw disks can reorder only 
over the simultaneously outstanding operations. The 
overhead of replication and checkpoints is negligible 
compared to disk latency, and performance is in- 
creased by SMARTER’s batching. 


A simple back-of-the-envelope computation shows 
how fast we expect the disk to be able to retire ran- 
dom writes, and demonstrates that SMARTER 
achieves that bound, meaning that (at least for ran- 
dom writes) the bottleneck is at the disk, not else- 
where. The disks we used have tracks about % of a 
megabyte in size, so the 10GB sqlio file was around 
14K tracks. SMARTER is using 512MB of cache, 
which is 64K 8KB-sized individual writes, or about 
4.7 writes/track. The 7200 RPM disk takes 8.3ms for 
a complete rotation. 4.7 writes per each 8.3ms rota- 
tion is about 570 writes/s, which is just a little more 
than Gaios’ performance. 


The random read test used 35 simultaneous outstand- 
ing reads. Gatos’ and PBN’s random reads (also 
shown in Figure 3) scale slightly sub-linearly with 
the number of replicas. They improve with the num- 
ber of replicas because SMARTER is able to employ 
the disk arms on the replicas separately, but the im- 
provement is less than linear because as it scales each 
replica has fewer simultaneous reads over which to 
reorder. Single replica Gaios has a read rate about 
14% lower than the local disk. PB1 didn’t vary in the 
count of replicas since it only reads from one node. 
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Figure 4: Burst Write Performance 


Figure 4 shows the write rates for 10 second bursts of 
8K random writes with 200 writes outstanding at a 
time. In this test, Gaios and PB logged and executed 
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the writes and returned the replies to the client, but 
because the volume of data written was smaller than 
the 512MB dirty cache limit, it was bounded only by 
logging not by the seek rate of the data disk. Because 
SMARTER answers writes when they’re written to 
the log, it does random write bursts at the rate of se- 
quential writes, while the local disk does them at the 
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Figure 5: Sequential Bandwidth 


Figure 5 shows Gaios’ performance for sequential 
IO. This test used megabyte size requests with 40 
simultaneously outstanding for writes and 10 eight 
megabyte requests for reads. It’s difficult to see on 
the graph, but the (old) local disk writes at about 88 
MB/s, while Gaios is at 67 MB/s. The difference is 
due to a difficulty in getting the data through the 
network transport. Writes for both Gaios and PB 
slow down marginally as they’re distributed across 
more nodes (and as they need to write the slower new 
disks at 4 and 5 replicas). PBN and Gaios’ reads are 
more interesting: unlike random IO, sequential IO is 
harder to parallelize because distributing sequential 
IO requests adds seeks, which reduces efficiency, 
sometimes more than the increase in bandwidth that’s 
achieved by adding extra hardware. This shows up in 
the PBN and Gaios lines, which perform at the local 
disk rate on a single replica, peak at 2 replicas (but at 
only 1.3 times the rate of a local disk) and drop off 
roughly linearly after. SMARTER probably would 
benefit from getting hints from the RSM about how 
to distribute reads. 


Figure 6 shows the operation latency for 8K reads 
and writes. Unlike the other microbenchmarks, this 
test only allowed a single operation to be outstanding 
at atime. For reads, Gaios is about 8% slower than a 
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local disk in the single replica case and 20% slower 
for 2-3 replicas. The difference in going from one to 
two replicas is that there is extra network traffic in 
the server to execute the read-only algorithm (see 
Section 3.3.2). Both versions of PB are about 2% 
faster than Gaios at 2 nodes, and 10-15% faster at 5 
(where Gaios has to touch three nodes for its quorum 
check). 
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Figure 6: Single Operation Latency 


Write latency is more interesting. In Gaios and P-B, 
the main contributor to latency is writing into the log, 
because the write rate is slow enough that the system 
doesn’t throttle behind the replica checkpoint even 
for a 5 minute run. Writing one item to the log, wait- 
ing a little while and the writing again causes the log 
disk to have to take an entire 8.3ms rotation before 
being able to write the next log record, which ac- 
counts for the bulk of the time in Gaios. Latency 
goes down at three replicas because only 2 of three of 
them need to complete their log write for the opera- 
tion to complete. As the replication grows PB gets 
slower than Gaios because of its requirement that the 
primary always be in every quorum. 


The reason for storing data in an RSM is to achieve 
fault tolerance. To measure how Gatos performs 
when a fault occurs we ran a 60 second version of the 
3 replica sequential read test and induced the failure 
of a replica half way through each of the runs. The 
resultant bandwidth was 127 MB/s, roughly equiva- 
lent to the 128MB/s of the non-faulty three node 
case. However, the maximum operation latency in- 
creased from 1500ms to 1960ms, because requests 
outstanding at the time of the failure had to time out 
and be retried. The large max latency in the non- 
failure case was due to the disk scheduling algorithm 
starving one request for a while and because of queu- 


ing delay (which is substantial with 10 8MB reads 
simultaneously outstanding). 


4.4 Transaction Processing 

In order to observe Gaios in a more realistic setting 
(and with a mixed read/write load), we ran an indus- 
try standard online transaction processing (OLTP) 
benchmark that simulates an order-entry load. We 
selected the parameters of the benchmark and config- 
ured the database so that it has about a 3GB log file 
and a 53GB table file. We housed the log and tables 
on different disks. In Gaios (and P-B) we ran each 
virtual disk as a separate instance of Gaios sharing 
server nodes, but using distinct data disks on the 
server. SMARTER shared a single log disk, so each 
server node used three disks: the SMARTER log, the 
SQL log and the SQL tables. 


This benchmark does a large number of small trans- 
actions of several different types, and generates a 
load of about 51% reads and 49% writes to the table 
file by operation count, with the average read size 
about 9K and the average write about 10K. We con- 
figured the benchmark to offer enough load that it 
was IO bound. The CPU load on the client machine 
running SQL Server was negligible. 


We used 64-bit Microsoft SQL Server 2008 Enter- 
prise Edition for the database engine. For each data 
point, we started by restoring the database from a 
backup, which resulted in identical in-file layout. We 
then ran the benchmark for three hours, discarded the 
result from the first hour in order to avoid ramp-up 
effects and used the transaction rate for the second 
two hours. This benchmark is sensitive to two 
things: write latency to the SQL Server log, and read 
latency to the table file. The writes are offered nearly 
continuously as SQL Server writes out its check- 
points and are mixed with the reads. 


Even though the load is half writes, the replicas spent 
significantly less than half of their time writing. This 
is because the writes were more sequential than the 
reads because they came from SQL’s database clean- 
er which tries to generate sequential writes, and they 
were further grouped by SMARTER’s checkpoint 
mechanism. Because of this, Gaios usually had one 
or more replicas that were not in checkpoint to which 
to send reads. Even though the load at the client was 
about half reads and half writes, at the server nodes it 
was *4 writes because each write ran on all three 
nodes, while reads ran only on one. This limited the 
effect of the increased random read performance of 
Gaios and PBN. 
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Figure 7 shows the performance of Gaios and the two 
PB versions running on a three node system in trans- 
actions per second normalized to the local-machine 
performance. Each bar is the mean of ten runs. Gai- 
os runs a little faster than the local node because its 
increased random read performance more than com- 
pensates for the added network latency and checksum 
IO. Because PBN is unable to direct its reads away 
from checkpointing nodes it is somewhat slower, 
while PB1 suffers even more due to its inability to 
extract read parallelism. 
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Figure 7: OLTP Performance 


5. Related Work 


Google [8] used a Paxos replicated state machine to 
re-implement the Chubby [5] lock service. They 
found that it provided adequate performance for their 
load of small updates to a state that was small enough 
to fit in memory (1OO0MB). It serviced all reads from 
the leader (there being no need to take advantage of 
parallel disk access because of in-memory state), and 
used a time-based leasing protocol to prevent stale 
reads, similar to primary-backup. Their highest re- 
ported update rate was 640 small operations per se- 
cond and 949 KB/s on a five node configuration, 
about one fifth and one sixtieth respectively of Gaios’ 
comparable performance on 5 nodes, though because 
the hardware used was different it’s not clear how 
meaningful this comparison 1s. 


Petal [24] was a distributed disk system from DEC 
SRC that used two-copy primary-backup replication 
to implement reliability. It used a Paxos-based RSM 
to determine group membership, but not for data. 
Data writes happened in two phases, first taking a 
lock on the data and then writing to both copies. On- 
ly when the writes to both copies completed was the 
lock released and the operation completed to the user. 
Much like Gaios, Petal used write-ahead logging and 
group commit to achieve good random write perfor- 
mance. Castro and Liskov [7] implemented a version 


NSDI 711: 8th USENTX Symposium on Networked Systems Design and Implementation 


of NFS that stored all of its data in a BFT replicated 
state machine. However, their only performance 
evaluation was with the Andrew Benchmark [16], 
which has been shown [38] to be largely insensitive 
to underlying file system performance. BFT replica- 
tion differs from Paxos in that it tolerates arbitrary, 
potentially malicious failures of less than a third of its 
replicas. It uses many more messages and a number 
of cryptographic operations to achieve this property. 


Several BFT agreement protocols [1, 9, 18] have 
much lower latency than Gaios. They achieve this by 
not logging operations before executing them and 
returning results to the client. Because of this, these 
systems cannot tolerate simultaneous crashes of too 
many nodes (such as would be caused by a datacenter 
power failure) without permanently failing or rolling 
back state. As such, they do not provide sufficiently 
tight semantics to implement tasks that require write 
through such as the store for a traditional database. 
They also are not evaluated on state that is larger than 
memory. Furthermore, because they tolerate general 
Byzantine faults, they need at least 3/+1 (and some- 
times more) replicas to tolerate f faults (though f of 
these replicas can be witnesses that do not hold exe- 
cution state [40]).Gaios tolerates many non- 
malicious (hardware or programming-error caused) 
Byzantine faults without the extra complexity of 
dealing with peers that are trying to corrupt the sys- 
tem. 


The Federated Array of Bricks (FAB) [34] built a 
store out of a set of industry-standard computers and 
disks, much like Gaios. It used a pair of custom rep- 
lication algorithms, one for mirrored data and one for 
erasure-coded. Unlike Paxos, it did not have a leader 
function or views; rather (in the mirroring case), it 
took a write lock over a range of bytes using a major- 
ity algorithm. Once the write lock was taken, it sent 
the write data to all nodes, and updated both the data 
and a timestamp. After a majority of the nodes com- 
pleted the write, it completed the operation back to 
the caller. To read data, it sent the read to all repli- 
cas, with one designated to return the data. The other 
nodes returned only timestamps; if the returned data 
did not have the latest timestamp, it retried the read. 
This scheme achieves serializability without needing 
to achieve a total order of operations as happens in an 
RSM. However, because its read algorithm requires 
accessing a _ per-block timestamp, it employed 
NVRAM to avoid the need to move the disk arms to 
read the timestamps; SMARTER’s algorithm simply 
asks for a copy of in-memory state from all of the 
replicas, and does the disk IO on only one and so 
does not need NVRAM. 
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Oceanstore [19] was designed to store the entire 
world’s data. It modified objects by generating up- 
dates locally and then running conflict resolution in 
the background, in the style of Bayou [11]. 
Oceanstore used a Byzantine-agreement protocol to 
serialize and run conflict resolution, but stored the 
data using simple lazy replication (or replication of 
erasure coded data). 


The Google File System [13] is designed to hold very 
large files that are mostly written via appends and 
accessed sequentially via reads. It relaxes traditional 
file system consistency guarantees in order to 1m- 
prove performance. In particular, write operations 
that fail because of system problems can leave files in 
an “inconsistent” state, meaning that the values re- 
turned by reads depend on which replica services the 
read. Furthermore, concurrent writes can leave file 
regions in an “undefined” state, where the result is 
not consistent with any serialization of the writes, but 
rather is a mixture of parts of different writes. After 
a period of time, the system will correct these prob- 
lems. GFS uses write-to-all, so faults require the 
system to reconfigure before writes can proceed. 


Berkeley’s xFS [2] and Zebra [15] file systems 
placed a log structured file system [32] on top of a 
network RAID. They worked by doing write-to-all 
on the RAID stripes, and then using a manager to 
configure out failed storage nodes. The xFS proto- 
type described in the paper did not “implement the 
consensus algorithm needed to dynamically reconfig- 
ure manager maps and stripe group maps.” 


Boxwood [27] offered a set of storage primitives at a 
higher level than the traditional array of blocks, such 
as B-trees. It used Paxos only to “store global system 
state such as the number of machines.” 


Everest [29] is a system that offloads work from busy 
disks to smooth out peak loads. When off-loading, it 
writes multiple copies of data to any stores it can find 
and keeps track of where they are in volatile memory. 
After a crash and restart, the client scans all of the 
stores to find the most up-to-date writes, and as long 
as one copy of each write is available, it recovers. 
This protocol works because there is only ever one 
client for a particular set of data. 


TickerTAIP [6] was a parallel RAID system that dis- 
tributed the function of the RAID controller in order 
to tolerate faults in the controller. It used two-phase 
commit [14] to ensure atomicity of updates to the 
RAID stripes. 


6. Summary and Conclusion 
Conventional wisdom holds that while Paxos has 
theoretically desirable consistency properties, it is too 
expensive to use for applications that require perfor- 
mance. We argue that compared to disk access laten- 
cies, the overhead required by Paxos on local net- 
works is trivial and so the conventional wisdom is 
incorrect. While replicated state machines’ in-order 
requirement seems to be at odds with the necessity of 
doing disk operation scheduling, careful engineering 
can preserve both. 


We presented Gaios, a system that provides a virtual 
disk implemented as a Paxos RSM. Gaios achieves 
performance comparable to the limits of the hardware 
on which it’s implemented on various microbench- 
marks and the OLTP load, while providing tolerance 
of arbitrary machine restarts, a sufficiently small set 
of permanent stopping failures and some types of 
Byzantine failures. We compared Gaios to primary- 
backup replication and found that it performs compa- 
rable to or in some cases better than P-B’s best case. 
We presented a novel read-only algorithm for 
SMARTER, and showed that because it allows reads 
to run on any node SMARTER can often avoid hav- 
ing reads and writes contend for a particular disk, 
giving significant performance improvements over 
even the best case of primary-backup replication for 
the mixed read/write workload of the OTLP bench- 
mark. 
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Abstract 


Lack of accountability makes the Internet vulnerable 
to numerous attacks, including prefix hijacking, route 
forgery, source address spoofing, and DoS flooding at- 
tacks. This paper aims to bring accountability to the In- 
ternet with low-cost and deployable enhancements. We 
present IPA, a design that uses the readily available top- 
level DNSSEC infrastructure and BGP to bootstrap ac- 
countability. We show how IPA enables a suite of secu- 
rity modules that can combat various network-layer at- 
tacks. Our evaluation shows that IPA introduces modest 
overhead and is gradually deployable. We also discuss 
how the design incentivizes early adoption. 


1 Introduction 


Accountability, the ability to identify misbehaving en- 
tities and deter them from misbehaving further, plays a 
critical role in achieving real-world security [41]. How- 
ever, the Internet design has little built-in accountability: 
malicious hosts can send denial of service (DoS) flooding 
packets with spoofed source addresses to evade punish- 
ment; and malicious Autonomous Systems (ASes) can 
announce other ASes’ IP prefixes or assume their identi- 
ties in the inter-domain routing system BGP. 

Lack of accountability has led to many of the In- 
ternet’s security vulnerabilities [20, 58], including dis- 
tributed DoS attacks that may disable a country’s Inter- 
net access [48, 49, 52], and prefix hijacking attacks that 
once made YouTube worldwide unreachable [25]. In 
this work, we ask the question: can we overcome the 
Internet’s main security weaknesses with a minimal set 
of gradually deployable changes? ‘That is, we aim to 
explore an approach that can fix the Internet’s security 
problems without replacing or breaking the deployed In- 
ternet base. We are attracted to this approach because 
of its practical value, as it can deliver benefits without 
building everything from scratch. 

In this paper, we present a design called IPA (IP made 
Accountable) that bootstraps accountability in the In- 
ternet with only low-cost and gradually deployable en- 
hancements. We show how the IPA design enables other 
security modules that together fix many of the Internet’s 
security problems, including preventing prefix hijacking, 
route forgery, and source address spoofing attacks, and 
limiting large-scale DoS attacks. We note that this work 
does not aim to provide all forms of accountability. For 
instance, IPA does not provide the type of strong ac- 


countability that offers evidence of correct execution, or 
audit and challenge interfaces [32, 60]. Rather, it aims 
to bring a similar form of network-layer accountability 
as defined in [20, 54] to the Internet, i.e., the ability to 
accurately identify the sources of all traffic and defend 
against malicious sources. 

We identify two key challenges in bootstrapping ac- 
countability in the existing Internet. The first one is 
how to securely bind an entity’s identity to its crypto- 
graphic keys in a lightweight manner, and the second one 
is how to do so in an adoptable manner, including being 
gradually deployable and incentivizing early adoption. 
Network-layer accountability requires a secure binding 
between an entity’s identity and its cryptographic keys 
to prevent impersonation and identity white-washing at- 
tacks [31]. The Internet uses two types of identifiers, IP 
addresses and AS numbers (ASNSs), to identify network 
attachment points and ASes, but it lacks a lightweight 
and adoptable mechanism to create the secure bindings 
between an IP address (or an ASN) and a network entity. 
Previous work [38, 46, 56, 57] proposes to use a cen- 
tralized global public key infrastructure (PKI) or web-of- 
trust to bind an IP prefix or an ASN to its owner’s public 
key. However, a dedicated PKI is too heavyweight [35], 
and web-of-trust lacks an authoritative trust chain to re- 
solve conflicting IP prefix or ASN claims. 

IPA uses three mechanisms to address these chal- 
lenges. First, it uses the top-level reverse DNSSEC hi- 
erarchy as a lightweight PKI to bind an IP prefix to its 
owner’s public key (§ 3.2), and the hash of an AS’s 
public key as its self-certifying ASN (8 3.1). This de- 
sign securely certifies an IP prefix’s ownership without 
a separate PKI, and obviates another PKI to certify an 
ASN’s ownership. We use DNSSEC [21], 22, 23, 50] 
because one can create a one-to-one mapping between 
an IP prefix delegation and a reverse DNS zone delega- 
tion, as the chains of trust in both delegation processes 
share the same root: the Internet Assigned Number Au- 
thority (IANA). Thus, we can use an IP prefix’s corre- 
sponding reverse DNSSEC record as its owner’s IP pre- 
fix delegation certificate. Moreover, Internet registries 
are rapidly deploying the top-level reverse DNSSEC in- 
frastructure [4, 6, 18, 19]. The root, the arpa, and 
the in-addr.arpa zones are already signed. Deploy- 
ment documents from key Regional Internet Registries 
(RIRs) [1, 2, 5] all suggest that the top-level reverse 
DNSSEC infrastructure would soon be fully deployed. 
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Second, IPA uses an efficient in-band protocol piggy- 
backed in BGP messages to “push” the IP prefix certifi- 
cates to all ASes to secure routing (§ 3.3). This design 
avoids the dependency loop between secure routing and 
online certificate distribution, and eliminates the need 
for a separate out-of-band certificate distribution mecha- 
nism. We strive to make the in-band distribution protocol 
efficient and capable of supporting complex operations 
such as certificate revocations and key rollovers (8 4). 

Third, we design IPA to be compliant with the exist- 
ing protocols to be gradually adoptable. It uses the BGP 
optional and transitive attributes to carry |PA-specific in- 
formation so that legacy ASes can pass this information 
to deployed ASes without interpreting them (8 7.3.1). 
Different ASes can deploy IPA at different times with- 
out a “flag day.” Furthermore, because we use the top- 
level reverse DNSSEC hierarchy to bind IP prefixes to 
their owners’ public keys, the ASes who obtain their IP 
prefixes from the Internet registries can obtain their pre- 
fix ownership certificates from the registries without de- 
pending on other infrastructures. This feature enables 
those ASes, which amount to 78% of all ASes on today’s 
Internet (§ 7.3.2), to form a deployed “club” to prevent 
various network-layer attacks within the club (8 5). 

We further show how IPA enables several security 
building blocks, including a secure routing protocol such 
as S-BGP [38], a source authentication system [43], and 
a DoS defense system [45] (§ 5). These security build- 
ing blocks are also gradually adoptable [26, 43, 45], and 
together can prevent prefix hijacking, route forgery, and 
source address spoofing attacks, and suppress DoS flood- 
ing traffic near its sources. 

We have implemented IPA using XORP [33] and inte- 
grated other security modules with it (8 6). We evaluate 
IPA’s performance and adoptability using trace-driven 
experiments (§ 7.2), live Internet experiments (8 7.3.1), 
and analysis (§ 7.3.2). The results suggest that IPA is 
lightweight and gradually deployable in the current Inter- 
net. Our trace-driven experiments show that IPA’s query 
overhead on an Internet registry’s DNS servers is less 
than 0.1% of a single root DNS server’s regular work- 
load. Its in-band certificate distribution protocol intro- 
duces modest overhead to a router. A single-threaded 
IPA implementation running on a commodity PC can 
process all messages a Route Views server [53] receives 
at their arrival rate. We expect that the server’s workload 
is representative of a large ISP’s BGP router’s workload, 
because the number of peers it has (37) is the top 6% 
largest among all ASes [8]. 

Our live Internet experiments show that IPA’s proto- 
col messages piggybacked in BGP can pass standard- 
compliant legacy routers. Our analysis suggests that 
IPA lowers the deployment cost for early adopters com- 
pared to previous work that requires dedicated PKIs [38, 
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46, 56, 57], but offers equivalent or stronger security 
strength. Thus, it is more likely to be adopted. 

To the best of our knowledge, IPA is the first de- 
sign that brings accountability to the Internet in a secure, 
lightweight, and gradually adoptable manner. 


2 System Models and Goals 


Before we present the IPA design, we first describe its 
system models and design goals. 


2.1 System Models 


Network Model: IPA adopts the same two-level hier- 
archical network model (nodes and ASes) as the present 
Internet. For inter-AS routing and forwarding, we treat 
an AS as one trust and fate-sharing unit. AS boundaries 
are also trust boundaries. For clarity, we abstract each 
AS as a node when describing AS-level operations. 


Trust Model: IPA assumes the same external trust enti- 
ties as the present Internet. The global root of trust is the 
Internet Assigned Numbers Authority (IANA). 


Threat Model: We assume that both hosts and routers 
can be compromised. Compromised nodes (hosts or 
routers) can collude into groups and launch arbitrary at- 
tacks. We also assume that an AS may be malicious, and 
malicious ASes can also collude. 


2.2 Design Goals 


IPA’s central design goal is to securely bootstrap ac- 
countability in the Internet with lightweight and adopt- 
able enhancements. We elaborate it in more detail. 


Secure: IPA aims to enable cryptographically provable 
network-layer identities. As we show in 8 5, this ability 
further enables various security modules that can prevent 
prefix hijacking [34, 38], route forgery [34, 38], source 
address spoofing [43], and DoS flooding attacks [45]. 


Lightweight: We aim to introduce only lightweight en- 
hancements to the Internet. We believe that enhancing 
the existing infrastructures with new functions has lower 
deployment costs than rolling out new global infrastruc- 
tures. For this reason, IPA does not require new global 
infrastructures, unlike [12, 38, 57]; nor does it require 
trusted hardware at end systems (although it can help), 
unlike [20]. Moreover, we aim to add little performance 
overhead to the deployed Internet base. 


Adoptable: We aim to make IPA adoptable, which im- 
plies two sub-goals: 


e Gradually Deployable: We aim to make IPA com- 
patible with the legacy Internet and ready to be de- 
ployed on the Internet. |PA-enabled ASes (or hosts) 
should be able to run IPA-related protocols even if 
they are connected by legacy ASes. 
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e Incentivizing Early Adoption: IPA should require 
low deployment costs and provide immediate secu- 
rity benefits to early adopters to incentivize deploy- 
ment. That is, the group of early adopting ASes 
should gain security benefits within the deployed 
region without requiring other entities outside the 
group to deploy IPA. 


3 Overview 


This section presents a high-level overview of IPA. We 
present more design details in the following section. IPA 
uses two key mechanisms to be lightweight and gradu- 
ally deployable: 1) it uses the top-level reverse DNSSEC 
infrastructure as a lightweight PKI to bind an IP prefix 
to its owner’s public key; and 2) it uses the BGP routing 
system to distribute IP prefix certificates in-band. 


3.1 A Hybrid Approach to Secure Identifiers 


The present Internet uses two types of identifiers: 1) a 
hierarchically allocated IP address (or prefix) to loosely 
identify a network attachment point (or a group of them 
in the same network), and 2) a flat AS number to identify 
an autonomous system. IANA 1s the root of trust and the 
owner of all IP addresses, i.e., the owner of 0/0. It del- 
egates sub-prefixes to RIRs, which in turn delegate even 
smaller sub-prefixes to ASes. ASes may further sub- 
delegate IP prefixes to their customers. Figure | shows 
an example of the address delegation hierarchy. 

To be gradually deployable, IPA retains the hierarchi- 
cal structure of IP addresses, and uses the existing chain 
of trust in the IP address allocation process to bind an IP 
prefix to its owner’s public key. Since ASNs do not have 
a hierarchical structure, IPA replaces them with ASes’ 
self-certifying identifiers, i.e., the hash of their public 
keys. This design reduces the deployment overhead at 
an Internet registry, as a registry need not bind an AS’s 
identifier to its public key. This new ASN format can be 
gradually deployed in a manner similar to how the 32-bit 
ASN was recently deployed [55]. 


3.2 DNSSEC as a Lightweight PKI 


The IPA design uses the top-level DNSSEC infrastruc- 
ture as a lightweight PKI for Internet registries to issue 
IP prefix delegation certificates. DNSSEC is originally 
designed to protect the integrity of DNS replies. Sim- 
ilar to a PKI, it allows a parent entity to use its key to 
certify a DNS zone delegation to a child entity. Each 
zone owner signs the DNS records in its zone, and pub- 
lishes their signatures in DNS for verification. When a 
client performs a DNSSEC query for a domain name, it 
can verify the authenticity of the answer by following the 
DNS hierarchy to obtain the relevant DNSSEC records. 
Using DNSSEC to certify IP prefix delegation has sev- 
eral advantages. First, we can create a one-to-one map- 
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Figure |: Left: the IP prefix allocation hierarchy; Right: 
the corresponding DNSSEC records that bind the prefixes 
to their owners’ public Keys. 


ping between a reverse DNS zone delegation and an IP 
prefix delegation, as the reverse DNS hierarchy and the 
IP address hierarchy share the same root (IANA). For 
example, when IANA delegates an IP prefix 165/8 to 
an RIR (ARIN), it can also delegate the corresponding 
reverse DNS zone, 165.in-addr.arpa, to ARIN (Fig- 
ure 1). This delegation further enables ARIN to create 
a one-to-one mapping between the IP sub-prefixes and 
the reverse DNS zone’s sub-delegations, e.g., delegat- 
ing 165.72/16 and 72.165.in-addr.arpa to an AS 
(AT&T). A prefix owner can use the DNSSEC records 
that certify its reverse DNS zone delegation as a certifi- 
cate authorizing its prefix ownership (8 4.1). We refer 
to this type of certificate as an JP prefix delegation cer- 
tificate or a prefix certificate. This design reduces IPA’s 
deployment costs at an Internet registry, as it need not 
maintain a separate PKI to certify IP prefix delegations. 

The second advantage is that Internet registries are 
rapidly deploying DNSSEC [7, 29, 50]. The root zone 
was signed in July 2010 [19], and later the arpa and 
the in-addr.arpa zones. [ANA will further sign the 
sub-zone delegations from in-addr. arpa in late March 
2011 [7]. Moreover, the three largest RIRs, ARIN, RIPE, 
and APNIC, have all stated in their websites that they are 
ready to or will soon be ready to sign reverse zone sub- 
delegations [1, 2, 5]. Since these RIRs own 142 out of 
175 sub-zones of in-addr.arpa[14], we expect that the 
top-level reverse DNSSEC will soon be fully deployed 
by all Internet registries. 

Finally, because DNSSEC supports online queries, an 
Internet registry can use it to publish new IP prefix cer- 
tificates to support key rollovers (8 4.5) or revocations 
(8 4.2), in addition to issuing certificates. An AS can 
query the DNS to download its up-to-date prefix certifi- 
cates and the Internet registries’ revocation lists. 


3.2.1 IP Prefix Sub-delegation 


After an AS obtains its IP prefixes, it may delegate sub- 
prefixes to its customers. For instance, Sprint in Fig- 
ure | allocates a sub-prefix 106.12.208/20 to its cus- 
tomer Surewest. The IPA design allows an AS to flex- 
ibly choose the infrastructure it uses to manage these 
sub-delegation certificates. An AS can choose to use 
DNSSEC, as does an Internet registry. Alternatively, it 
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may use a certificate authority server to issue the IP pre- 
fix certificates. In the latter case, an AS should also sup- 
port a certificate publishing mechanism (e.g., a secure 
web server or an FTP server) to enable its customers to 
download their up-to-date certificates online. This re- 
quirement is to support automatic key rollovers (8 4.5). 
We believe that an AS has incentives to manage and pub- 
lish its customers’ certificates, because this effort can 
protect its customers from prefix hijacking attacks. 

For clarity, in the IP prefix delegation process, we refer 
to the delegator as the parent owner, and the delegatee as 
the child owner. 


3.3. In-band Certificate Distribution 


To prevent routing attacks, ASes must use a secure rout- 
ing protocol (e.g., S-BGP [38], 8 5.1) to validate prefix 
origins and AS paths in BGP messages. This requires 
ASes to first obtain valid IP prefix certificates. 

IPA uses BGP itself to distribute these certificates in- 
band to ASes that need them. That is, when an AS orig- 
inates an IP prefix in a BGP message, it piggybacks the 
chain of certificates that can prove its prefix ownership 
in the message. We use a BGP feature, the transitive and 
optional path attribute, to carry the certificates. An AS 
can first obtain the chain of certificates offline when it 
obtains the IP prefix from its parent AS or an Internet reg- 
istry. Later, it can periodically download the full chain of 
the latest certificates, as we will describe in 8 4.5. 

This design has several advantages. First, it avoids the 
dependency loop between secure routing and online cer- 
tificate distribution. If we use an alternative approach 
where each AS downloads the prefix certificates from 
online distribution servers (e.g., DNSSEC servers), a de- 
pendency loop between routing and certificate distribu- 
tion may occur. This is because to obtain a prefix p’s cer- 
tificate C,,, an AS X must first establish a valid path to 
an AS Y that hosts C;,’s distribution server. Recursively, 
to establish a valid path to AS Y, X must validate the 
BGP messages advertising AS Y’s prefixes, which re- 
quires AS X to have obtained AS Y’’s prefix certificates. 
These certificates may be served by a distribution server 
in yet another AS Z, and to establish a valid path to 7, X 
needs the certificates for Z’s prefixes, and so on. These 
dependencies may eventually form a loop, preventing AS 
X from obtaining the certificates needed to validate the 
prefix p’s ownership. 

In contrast, in-band distribution does not introduce 
such dependencies. This is because it does not require an 
AS to establish an a priori valid path to an online distri- 
bution server. BGP messages are propagated hop-by-hop 
(at the AS level). An AS will first obtain valid certificates 
from its neighbors, and then from its neighbors’ neigh- 
bors, and so on, until it obtains the valid certificates from 
all ASes in the routing system. 
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Second, in-band distribution lowers deployment costs, 
as it does not need an out-of-band channel to distribute 
the certificates, unlike [38, 56]. IPA also uses standard 
BGP features to encode the certificates so that different 
ASes may gradually adopt the distribution mechanism 
without breaking BGP. 

Finally, including a prefix p’s full chain of certificates 
ensures that any AS that receives a BGP message origi- 
nating p can immediately validate p’s owner’s public key. 
This further ensures that an AS can promptly validate the 
prefix origin and AS path in the BGP message (8 5.1) and 
propagate the message and the chain of certificates fur- 
ther to its neighbors. These neighbors can in turn use the 
certificates to validate the BGP message and propagate 
it further, until all ASes have received and validated the 
BGP message. We refer to this property as liveness, and 
provide a formal proof of it in [42]. We discuss how to 
validate a certificate in § 4.4. 

Attaching a full chain of certificates ina BGP message 
incurs significant communication overhead. IPA uses a 
simple but effective technique to reduce this overhead: 
each AS caches the certificates that it has sent to a neigh- 
bor and only sends to the neighbor the certificates that it 
has not sent yet. We describe it in more detail in § 4.3. 


4 Design Details 


This section presents more design details of IPA, includ- 
ing how to use DNSSEC records to encode an IP pre- 
fix certificate (8 4.1), certificate revocation (8 4.2), effi- 
cient certificate distribution (8 4.3), certificate validation 
(8 4.4), and key management (8 4.5). 


4.1 DNSSEC Records as IP Prefix Certificates 


IPA uses three types of a reverse DNS name’s resource 
records to encode a prefix certificate: the designated 
signer (DS) record, the public key (DNSKEY) record, 
and the signature (RRSIG) record of the DS record. 

Figure 2 shows the DNSSEC records that form the cer- 
tificate for the prefix 165/8, which IANA allocates to 
ARIN (Figure 1). These records are associated with the 
DNSSEC entry 165.in-addr.arpa created by IANA. 
IANA uses the DS record to store the hash of ARIN’s 
pubic key, and signs the DS record using its private 
key. It sets the inception and expiration times of the sig- 
nature record (RRSIG) to the inception and expiration 
times of the prefix allocation, and publishes the entry 
165.in-addr.arpa on its DNS servers. This process 
follows the standard DNSSEC practice, and also applies 
to IPv6 address allocation. 

A slight complication arises as not all IP address 
allocations fall on a reverse DNS domain boundary. 
For instance, as shown in Figure 1, ARIN may allo- 
cate an IP prefix 106.12/14 to Sprint. We address 
this issue by extending the encoding format of a re- 
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165.in-addr.arpa DNSKEY KARIN 
(290 bytes) 


165.in-addr.arpa DS Hash(KARIN) 
(50 bytes) 
165.in-addr.arpa RRSIG DS 
(312 bytes) 


Figure 2: This figure shows the DNSSEC records that en- 
code the prefix 165/8’s certificate. The size of each record is 
estimated assuming that the signatures are generated using 
2048bit RSA/SHA-1. 





verse DNS name. For instance, we use the reverse 
DNS name 12/14.106.in-addr.arpa to encode the 
IP prefix 106.12/14. The encoding/decoding rules 
are straightforward and compatible with the DNS stan- 
dard [47]. We omit them due to the lack of space, but 
describe them in [42]. We choose not to use the exist- 
ing techniques that support classless reverse zone dele- 
gations [27, 30], because they either only support alloca- 
tions in chunks smaller than a /24 prefix [30], or are no 
longer supported by popular DNS servers [9, 27]. 


4.2 Revoking an IP Prefix Certificate 


An Internet registry or an AS may revoke a certificate 
allocated to a child before it expires. This may occur if 
the prefix is re-assigned to a new child owner, or the child 
owner’s key is compromised, or the child owner violates 
the terms of use or switches to a different ISP. 

In the IPA design, a parent owner issues a new pre- 
fix certificate to explicitly revoke the old one. The new 
certificate binds the IP prefix to a new public key with a 
newer inception time. The new key could be a new child 
owner’s key, or the present child owner’s new key, or the 
parent’s own key if it reclaims the IP prefix from a child. 

As we discuss in § 3.3, IPA distributes IP prefix certifi- 
cates in the routing system for ASes to validate routing 
messages. To use a certificate to validate a routing mes- 
sage, an AS must know whether the certificate has been 
revoked or not. IPA uses both push and pull mechanisms 
to notify an AS of a certificate’s revocation status. 


Pushing New Certificates via Routing: Because a new 
certificate explicitly revokes an old one, a new certifi- 
cate’s owner can immediately announce the new certifi- 
cate in BGP using the in-band distribution mechanism to 
notify other ASes of the old certificate’s revocation. 


Periodic Pulling From Internet Registries: § When 
an Internet registry revokes a prefix certificate, the reg- 
istry may be unable to notify other ASes using the 
push-based mechanism, because it does not participate 
in routing. We use a DNSSEC-based revocation list 
to address this problem. A revocation list includes the 
set of IP prefixes an Internet registry reclaims from 
its children, or re-assigns to its children that are also 


Internet registries. The registry can publish the list 
using a TXT record with a special DNS name, e.g., 
revoked. arin.in-addr.arpa, and sign the list using 
DNSSEC. An entry in a revocation list includes the re- 
voked IP prefix and the revocation time. It revokes any 
older prefix certificate signed by the same registry and 
whose address range overlaps with the revoked prefix. 

Each AS periodically (e.g., daily) downloads the re- 
vocation lists from all Internet registries to invalidate re- 
voked certificates (§ 4.4). An AS does not query DNS at 
the certificate validation time to reduce DNS load. Peri- 
odic downloads may delay a certificate’s revocation, but 
we consider this delay acceptable, as it will not lead to 
prefix hijacking attacks. Only the IP prefixes not allo- 
cated to any AS will suffer this delay, as an AS that owns 
an IP prefix can immediately announce its new certificate 
in BGP to revoke the old one. 


4.3 Efficient Certificate Distribution 


As we describe in § 3.3, IPA uses a BGP message itself to 
distribute the full chain of certificates of the IP prefix that 
the message advertises. We now describe how to make 
this in-band distribution protocol efficient. 

Each AS maintains several certificate caches to record 
what it has sent to a neighbor and to maintain certificate 
validation state, as shown in Figure 3. The caches in- 
clude: 1) an incoming certificate cache that stores all cer- 
tificates received from its neighbors; 2) a trusted certifi- 
cate cache that stores the certificates it has validated; and 
3) a per-neighbor outgoing certificate cache that records 
the hash of each certificate it has sent to the neighbor. 
An AS organizes the certificates in its trusted cache in a 
tree-like structure following the IP allocation hierarchy 
to assist certificate validation (§ 4.4). 

When an AS receives a prefix certificate from a neigh- 
bor, it first stores the certificate in its incoming cache, and 
then validates the certificate as we describe next. When 
the AS sends a BGP message to a neighbor announcing 
the IP prefix, it will retrieve the full chain of certificates 
from its trusted certificate cache, and compare them with 
those in the neighbor’s outgoing certificate cache. It will 
only send the certificates that are not in the neighbor’s 
outgoing cache, and then insert them in the outgoing 
cache to avoid sending them to the neighbor again. 

When an AS loses the peering connection to a neigh- 
bor, e.g., due to a router reboot or link failure, it will re- 
move all entries in the neighbor’s outgoing cache. When 
the AS resumes its connection with the neighbor, it will 
re-send the full chain of certificates for each prefix it an- 
nounces to the neighbor. 


4.4 Validating IP Prefix Certificates 


When an AS receives a BGP message that advertises 
a prefix p, and includes a list of certificates from a 
neighbor, it must validate these certificates to verify p,,’s 
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Figure 3: An example of the certificate caches an AS main- 
tains. It shows only one outgoing cache of the AS. 


owner’s public key. It considers a prefix p,,’s certificate 
C,,, valid if C;,,,, meets the following conditions: 


1. C,,, is not on any Internet registry’s revocation list 
or revoked by a newer certificate (§ 4.2). 

2. C>p,, has a valid parent certificate C,,,,_, such that 1) 
C,,, 1S signed by its parent certificate C,,,_,’s pri- 
vate key; 2) p,, 1s a subset of its parent certificate’s 
prefix pn—1. If p, is the prefix 0/0, C,, need not 
have a parent but must be self-signed by IANA. 


Algorithm | shows the pseudo-code for the validation 
algorithm. Most steps of the algorithm check whether 
C’,,, satisfies the above conditions. We note two things. 
First, if Cp,, does not have a valid parent certificate, C’,,, 
becomes unverifiable. Unverifiable certificates may ex- 
ist temporarily during a key rollover event (8 4.5). The 
algorithm returns failure but leaves C,,, in the incoming 
cache, as it may become valid later after its parent certifi- 
cate has arrived. Second, the last section of the code (line 
23-26) adds the newly validated C,,, to the AS’s trusted 
cache and checks whether any previously unverified cer- 
tificate C;; is now verifiable, which may happen if C,,,, 1s 
its parent. If such a certificate C’; exists, the algorithm 
recursively validates it and its child certificates. 


4.5 Key Management 


Like any cryptography-based system, IPA’s accountabil- 
ity builds on the secrecy of private keys. In addition to 
the standard practice to protect secret keys, IPA takes two 
additional measures: 1) separating an AS’s identity keys 
from the keys the AS uses to sign routing messages, and 
2) periodic key rollovers. 


4.5.1 Separating Identity Keys from Routing Keys 


To secure routing, an AS must store its private key on- 
line to sign routing messages (§ 5.1). Yet it is desirable 
to keep a private key offline to reduce the risk of key 
compromise. To balance security and functionality, IPA 
separates an AS’s identity keys from the keys it uses to 
sign routing messages. We refer to the pair of keys asso- 
ciated with an AS’s self-certifying identifier as its iden- 
tity keys, or its identity key when we refer to either the 
AS’s private or public key. 
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Algorithm 1 validate(C,,,, ): pseudo-code to validate the cer- 

tificate C’p,, in an incoming BGP update message msg. 

Input: C,,,,, the incoming certificate to be validated; py, 
the prefix of C,,,,; msg, the incoming BGP message; 
cache,,/cachej,, the current trusted/incoming cer- 
tificate cache; rlist|r|, the most recent revocation 
list of registry r 

1: if is_registry(C;,, .signer) 

and p,, € rlist|C;,,,, signer] then 


2: cachez~remove(Cy,.) 
3: return false 
4: end if 
5: Cy,,_, <= cache;z,.lookup_parent(C,,, ) 
6: if C,,,_, == NULL then 
7, Cy, _, <= msg.lookup-_parent(C7, ) 
8: if C,,_, == NULL or not validate(C,,,,_,) then 
9: return false 
10: endif 
11: end if 
12: for C’; € cachez,.get-children_certs(C,,,_,) do 
13: if overlap(p,, p,;) then 
14: if C,,, .nception > C’;.inception then 
15: cache;,.recursive_ remove(C';) 
// remove all certificates in C’;’s subtree 
16: cache;,,.remove(C';) 
17: else 
18: cache, remove(Cy,, ) 
19: return false 
20: end if 
21: endif 
22: end for 


23: cachez,.insert(Cy,, ) 

24: for C; € cache;, and C; ¢ cache;, do 
25: _-validate(C;) 

26: end for 

27: return true 


An AS generates a separate pair of public/private keys 
to sign routing messages. We refer to this pair of keys as 
an AS’s routing keys. For each IP prefix it owns, an AS 
will use its identity key to sign a routing certificate that 
binds the IP prefix to its routing key. The AS keeps its 
identity private key offline, and uses its routing private 
key to sign routing messages. An AS will include a pre- 
fix’s routing certificate in its BGP messages. Other ASes 
can validate it using the algorithm described in 8 4.4. 


4.5.2 Routing Key Rollover 


By separating identity keys from routing keys, an AS 
can periodically expire its routing keys, issue new ones, 
and sign its new routing certificates with its identity key, 
all without changing its identifier, or re-signing its prefix 
sub-delegation certificates. 
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4.5.3 Identity Key Rollover 


An entity should also change its identity keys periodi- 
cally to improve security. To change its identity keys, an 
entity must 1) request new certificates from its parents, 
2) revoke its old certificates, and 3) re-sign each child 
certificate with its new private key. As can be seen, this 
process is more complicated than routing key rollover. 
Thus, an entity should change its identity keys at a lower 
frequency than its routing keys. 

A key challenge we face is how to make a child certifi- 
cate remain valid throughout a parent key rollover event 
so that other ASes can verify the child’s routing mes- 
sages. We address this challenge by “pre-releasing” a 
child’s new prefix certificate, a technique similar to how 
DNSSEC manages key rollovers [39]. With this mecha- 
nism, both a child’s old and new certificates remain valid 
during a key rollover event. 

For clarity, we first describe the identity key rollover 
process for an AS, and then for an Internet registry. Fig- 
ure 4 shows this process. Let D be an AS that wishes to 
rollover to a new identity key K,,-,. D will first use its 
old key K,;q to generate a transient certificate certifying 
Knew for each prefix it owns. The transient certificates 
are only available during key rollovers, and will expire 
afterwards. Meanwhile, D generates a new certificate 
for each sub-prefix it delegates to a child using its new 
key Knew. D will also generate new certificates to cer- 
tify its routing keys using K,-,. At this point, both Kgig 
and Kyew are valid identity keys of D, because each of 
them can be certified by a valid chain of certificates, as 
shown in Figure 4(b). D will then publish the child cer- 
tificates signed using its new key K,,-y via its certificate 
publishing system as described in § 3.2.1. 

Each AS will periodically (e.g., once a day) query its 
certificate issuers’ publishing systems to download its 
latest chains of certificates. If the AS obtains IP prefix al- 
locations directly from an Internet registry, it will query 
the corresponding reverse DNS names of its IP prefixes 
starting from the root servers. Otherwise, the AS queries 
its parent ASes’ certificate publishing systems. This on- 
line certificate downloading step does not have a depen- 
dency loop with routing, because each AS’s old certifi- 
cate chain is already in the routing system, and can be 
used to establish valid paths. If an AS C’ downloads a 
new certificate signed by its parent D’s new key, it will 
immediately announce its new certificate in BGP. Other 
ASes will consider C’s new prefix certificate valid, be- 
cause it is certified by a valid chain of trust, including 
the link provided by the parent D’s self-signed transient 
certificate, as shown in Figure 4(b). 

Finally, the rekeying AS D requests each of its parents 
P that has delegated an IP prefix to its old key Kojq to 
issue a new certificate to its new key Kyew, after waiting 
for a long enough period d. The waiting period d should 


Parent (P) (Kp ) 
Child (C) (Ke ) 


(a) (b) (c) 


Rekeying 
Entity (D) 





Figure 4: This figure shows IPA’s key rollover process. 
Each node represents a key; an arrow points from a par- 
ent’s signing key to a child’s signed key. Figure (a) shows 
the chain of trust before the key rollover; (b) shows the 
chains of trust during the key rollover, where the rekey- 
ing entity D signs a transient certificate to certify its new 
key Knew using its old key Kia; (c) shows when the key 
rollover process finishes, the old key K;g becomes invalid. 


be long enough to ensure that each child AS of D has 
successfully downloaded and announced its new certifi- 
cates in BGP. D can then announce its new certificate for 
its new key Kyew in BGP to revoke its old certificate. 
The child AS C’s certificate will remain valid, as shown 
in Figure 4(c). An AS D will also re-send its BGP routes 
to its neighbors using its new identifier. 

An Internet registry’s key rollover procedure is similar, 
except that the registry need not announce a new certifi- 
cate in BGP, as its children will obtain it via DNSSEC. 


4.5.4 Recovering From Key Compromise 


With the preventive measures we describe above, we ex- 
pect key compromise to be a rare event in IPA. For com- 
pleteness, we briefly describe how to recover from it and 
leave the details to [42]. 

Recovering from key compromise resembles a key 
rollover event, except that an entity may resort to con- 
tacting its parents and children offline to obtain its new 
certificates and distribute its children’s new certificates. 
This is because when an attacker compromises an en- 
tity’s identity keys, it may also hijack the entity’s IP pre- 
fixes, making it unreachable online. 


5 Use of IPA 


In this section, we describe how IPA enables various 
security modules that collectively achieve accountable 
routing and forwarding, and DoS attack mitigation. Each 
of the modules we describe here is also gradually adopt- 
able [26, 38, 43, 45]. 


5.1 Accountable Routing 


IPA enables secure routing protocols such as S- 
BGP [38], because it provides ASes with the necessary 
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certificates to achieve origin authentication and AS path 
authentication. 


Origin Authentication: An AS O that owns a prefix p 
can now sign its BGP messages when it announces the 
prefix, because other ASes can use the chain of certifi- 
cates piggybacked in the BGP messages to verify the se- 
cure binding between the prefix p and O’s public key 
(§ 3.3), preventing other ASes from originating p. 


AS Path Authentication: Each transit AS can sign a 
BGP update using its private key when it prepends its 
self-certifying AS identifier to the update and propagates 
the update to a neighbor. A malicious AS cannot forge 
another AS’s identifier, nor can it truncate the AS path, 
because it cannot generate a valid signature of another 
AS. A transit AS can piggyback its public key in a BGP 
message similar to how IPA distributes prefix certificates 
(8 3.3). We can also apply the same caching technique 
described in § 4.3 to reduce the message overhead. 

Self-certifying ASNs prevent path forgery, but raise 
a different security concern: an AS may mint arbitrary 
identifiers, which complicates BGP policy configura- 
tions. The IPA design addresses this concern by binding 
a self-certifying ASN to an IP prefix. If an AS path con- 
tains an ASN that is not a hash of a public key found in 
a valid IP prefix certificate, other ASes can consider the 
path not trustworthy, and configure their BGP policies to 
avoid this path. Moreover, an AS can use IP prefixes to 
configure its BGP policies, because other ASes cannot 
arbitrarily change their IP prefixes. 


5.2 Accountable Forwarding 


The ability to securely sign BGP messages enables Pass- 
port [43], a system that can achieve both packet source 
authentication and forwarding path inconsistency detec- 
tion. Passport uses a distributed Diffie-Hellman key ex- 
change piggybacked in BGP to establish a shared secret 
between every pair of ASes. With IPA, an AS O can sign 
the BGP messages that originate both its prefixes and its 
Diffie-Hellman public value. Other ASes can securely 
bind the secrets they share with AS O with O’s prefixes 
to enable AS-level packet source authentication and path 
inconsistency detection. 


Packet Source Authentication: To authenticate a 
packet’s source address, a source AS stamps a sequence 
of message authentication codes (MACs) into a packet 
header using the secret keys it shares with each AS en 
route to the packet’s destination. ASes along the path 
can re-compute the MACs to validate the packet’s origin 
AS, as packets with spoofed source addresses will not 
have valid MACs. 


Forwarding Path Inconsistency Detection: A mali- 
cious AS may attempt to advertise one legitimate AS 
path but forward packets along a different one that con- 
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flicts with a source AS’s routing policies. The MACs that 
a source AS stamps into a packet header can help detect 
this misbehavior. This is because if a packet’s forward- 
ing path differs from the AS path its source AS selects to 
use, an AS on the path will detect an invalid MAC, but 
the destination AS will detect a valid one. A destination 
AS can use this discrepancy to notify the source AS of 
the forwarding path inconsistency. 


5.3. DoS Attack Mitigation 


Finally, because IPA enables source authentication, it 
also enables DoS defense systems that use authen- 
tic source addresses to suppress attack traffic near its 
sources, e.g., a filter based system StopIt [44], or Net- 
Fence [45], a system based on unspoofable congestion 
policing feedback. 

As an example, we describe briefly how NetFence can 
use IPA to suppress DoS flooding traffic near its sources. 
NetFence introduces a secure congestion policing frame- 
work in the network. A NetFence packet carries un- 
spoofable congestion policing feedback in a shim layer. 
An on-path AS updates this feedback to notify an access 
router of its local congestion conditions, and an access 
router uses this feedback to regulate a sender’s sending 
rate. The on-path AS and the source AS use the secret 
they share via Passport to protect this feedback from be- 
ing tampered by malicious routers or end systems. When 
malicious sources and receivers collude to flood a link in 
the network, NetFence provides a legitimate sender its 
fair share of bandwidth. When a receiver is an inno- 
cent DoS victim, NetFence enables the receiver to use 
the unspoofable congestion feedback as network capa- 
bilities [59] to suppress the bulk of unwanted traffic. 

We introduce AS-level hierarchical accountability to 
NetFence to accommodate IPA’s self-certifying ASNs. 
The original NetFence design uses AS-level queues at a 
router to hold each source AS accountable for its traffic. 
With IPA, we use hierarchical queuing [24] that follows 
the IP allocation hierarchy to hold each AS accountable. 
That is, the traffic from all IP prefixes allocated to an 
AS’s public key will share one queue; a router may sub- 
divide the queue into multiple lower-level queues, if the 
AS delegates sub-prefixes to its customers, and so on. A 
router sets a queue’s weight according to the size of the 
IP prefixes associated with the queue, not by the number 
of ASes sharing the IP prefixes. This mechanism pre- 
vents an AS from gaining unfair network resources by 
dividing its IP prefixes into many smaller ones and dele- 
gating them to minted identifiers. 


6 Implementation 


We have implemented a prototype of IPA’s in-band cer- 
tificate distribution mechanism (8 3.3) using XORP [33]. 
The implementation includes a standalone C++ library 
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libipa that other BGP implementations can use. The li- 
brary Libipaimplements certificate distribution and val- 
idation, and supports downloading revocation lists and 
new certificates from DNSSEC. 

Our implementation addresses several practical issues 
that arise when an IPA router peers with a legacy router. 
First, we disable the optimization technique (8 4.3) on 
an IPA router’s interface facing a legacy router, because 
a legacy router does not cache any certificate or public 
key. Furthermore, legacy BGP has a 4KB limit on the 
size of an update message. To bypass this limitation, 
an IPA router breaks a message longer than 4KB into 
smaller ones, each of which carries a subset of the certifi- 
cates and public keys of the original message. The router 
sends them in sequence to its legacy neighbor. The IPA 
router waits for a period of time longer than the BGP’s 
MRAI timer (e.g., a few minutes) between sending out 
two consecutive messages to prevent the first message 
from being overwritten by the second one. 

We have also extended previous implementations of S- 
BGP, Passport, and NetFence and incorporated them into 
the IPA prototype. We defer a systematic evaluation on 
the integrated architecture to future work. 


7 Evaluation 


In this section, we evaluate IPA along four dimensions. 
First, we use small-scale testbed experiments to validate 
the design and implementation. Second, we use trace- 
driven benchmarks to measure the design’s performance 
and overhead. Third, we use live Internet experiments 
and analysis to evaluate the design’s adoptability. Fi- 
nally, we analyze IPA’s security properties. 


7.1 Testbed Experiments 


We use DETERIab [28] experiments to validate the 
design and implementation of IPA. These experiments 
include 1) bootstrapping experiments, 2) key rollover 
experiments, and 3) prefix hijacking experiments. We 
sample a small test topology from the AS-level Internet 
topology inferred from BGP table dumps. This topology 
includes six university ASes and all ASes on the shortest 
AS paths between the six ASes. It contains 17 ASes and 
54 uni-directional links. We desire to run larger-scale 
experiments, but are limited by the number of testbed 
machines we can obtain. For simplicity, we assume each 
AS owns one prefix, and choose the prefix to be the 
largest one the AS owns in reality. Finally, we assume all 
ASes use DNSSEC to issue and publish their certificates, 
and use the signing tool included in BIND9 [3] to gen- 
erate the certificates. The topology includes four levels 
of IP prefix allocation: IANA, RIRs, top-level ASes, and 
customer ASes. We randomly pick three ASes to host the 
root and two RIRs’ DNSSEC servers. We assume each 
AS’s DNSSEC server is inside its network. Each node 
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Figure 5: This figure shows the average DNS traffic load of 
each Internet registry to serve the revocation list and the IP 
prefix certificates. 


in a testbed experiment corresponds to an AS. Each AS 
is configured with an initial IP prefix certificate chain. 
We summarize the testbed experiment results as fol- 
lows. In a bootstrapping experiment, each node can val- 
idate all certificates and store them in its trusted cache, 
suggesting that the system can successfully bootstrap, 
consistent with the liveness property of IPA’s in-band 
certificate distribution protocol (§ 3.3). In a key rollover 
experiment, the rekeying ASes can successfully propa- 
gate their new certificates, and each prefix always has at 
least one valid chain of certificates during the rollover 
period. Finally, we run our S-BGP module using the cer- 
tificates distributed by IPA. We launch a prefix hijack- 
ing attack from an AS. All other ASes reject the update 
message because there does not exist a certificate chain 
certifying the AS’s ownership of the hijacked prefix. 


7.2 Performance 


IPA adds overhead to both DNS and BGP. We use trace- 
driven benchmarks to evaluate this overhead. The results 
show that IPA’s overhead on DNS and BGP is accept- 
able. We use a PC with Xeon 3GHz CPU and 2GB mem- 
ory to run all of our experiments unless otherwise noted. 


7.2.1 DNS Overhead 


IPA uses a signed TXT record in DNS to publish an Inter- 
net registry’s revocation list (§ 4.2). An AS periodically 
downloads the revocation list from each registry. Each 
entry in a revocation list can be encoded in <30 bytes 
(<18 bytes for an [Pv4 prefix in the dotted-decimal for- 
mat, one byte for space, 10 bytes for the revocation time, 
and one byte for the line break). A publisher can com- 
press a list (e.g., using gzip) to reduce overhead. An AS 
also needs to download the list’s signature (~300 bytes) 
and a few other DNSSEC records. 

We assume that at any time, a registry at most revokes 
1% of the total prefixes that it owns and does not re- 
allocate them to others. We use gzip to compress each 
revocation list, and use base64 to encode a compressed 
list so that it can be stored as a text record. The BGP re- 
port of February 2011 [15] shows that there are a total of 
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BGP Table Dump 
Date collected | 08/01/2010 





Number of ASes | 35728 
Number of IP prefixes | 337K 
BGP Update Trace 





Vantage point 
Number of peers | 37 


route-view2.oregon-ix.net 


08/01/2010~08/3 1/2010 
118 million 
44.1 updates/s 


Date collected 
Number of updates 
Average arrival rate 


Table 1: This table summarizes the BGP data we use in 
evaluating |PA’s routing overhead. 


37K ASes on the Internet. We assume that an AS down- 
loads a revocation list once per day. This downloading 
frequency is acceptable, because it at most allows a pre- 
fix’s previous owner to use the prefix for one extra day. 

Figure 5 shows the average traffic load for serving the 
list at each Internet registry’s DNS servers. As can be 
seen, even for the busiest registry ARIN, the estimated 
communication overhead is less than 10Kbps. This over- 
head is negligible compared to the regular load of a top- 
level DNS server, e.g., the “M’” root DNS server’s regular 
load is over 32Mbps [10]. 

In the IPA design, an AS may also periodically down- 
load its certificate chains from the Internet registries to 
deal with key rollovers (§ 4.5). To evaluate this overhead, 
we assume that all ASes publish the IP prefix certifi- 
cates they delegate to their children using DNSSEC. This 
places an upper bound on the top-level DNS servers’ 
load. Each certificate includes three DNSSEC records 
and is about 650 bytes long (8 4.1). We assume that each 
AS downloads its certificates once every day for each 
prefix it owns. Figure 5 shows the average traffic load 
from all registries for serving the certificate downloads. 
As can be seen, the IANA’s DNS servers have the high- 
est certificate serving overhead, but it is still much lower 
than a root DNS server’s regular load, which suggests 
that IPA is unlikely to stress DNS. 


7.2.2 Routing Overhead 


We use trace-driven experiments to evaluate the over- 
head of IPA’s in-band certificate distribution mechanism. 
We obtain a real BGP update trace from a Route Views 
server [53]. Table 1 summarizes the BGP data we use. 
We then add IPA specific fields and updates to the trace 
to obtain a synthetic IPA BGP trace. We use the synthetic 
IPA trace to estimate the message overhead of distribut- 
ing IP prefix certificates in-band. We also feed the IPA 
trace to a PC router running our IPA implementation, and 
measure the router’s processing and memory overhead. 
We generate the IPA BGP trace in three steps: 1) in- 
ferring IP prefix delegation hierarchy; 2) adding certifi- 
cates for newly allocated and re-assigned prefixes; and 
3) adding updates triggered by key rollover events. We 
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Figure 6: The distribution of the depth of each prefix in the 
inferred IP prefix delegation hierarchy. 


describe each step in more detail. 


First, we infer a prefix’s delegation hierarchy to de- 
cide what certificates to add to a BGP update message 
announcing that prefix. We use a BGP table dump to in- 
fer this information. If an AS originates an IP prefix in 
the BGP table, we assume that it is the prefix’s owner. 
If a prefix p’ includes another prefix p, and both prefixes 
appear in the BGP table, we infer that p’’s owner AS del- 
egates the prefix p to p’s owner. We also combine the IP 
prefix allocation records obtained from RIRs and IANA’s 
websites to build the entire IP prefix delegation hierar- 
chy. Figure 6 shows the distribution of the depth of the 
inferred hierarchy. More than 80% prefixes have a dele- 
gation depth of 3 or 4, suggesting that most ASes obtain 
IP prefixes directly from the RIRs or from provider ASes 
that directly obtain IP prefixes from the RIRs. 


Second, we add prefix certificates to BGP updates that 
announce newly allocated or re-assigned IP prefixes. Ac- 
cording to the IPA design (§ 3.3), an AS only sends an 
IP prefix certificate to a neighbor if it has not sent the 
certificate to the neighbor before. Thus, after the rout- 
ing system has bootstrapped, only two types of updates 
carry IP prefix certificates: 1) an update that announces 
a newly allocated or re-assigned prefix, and 2) an up- 
date that carries new certificates generated during key 
rollovers (§ 4.5) for a previously announced prefix. We 
treat any IP prefix that has not appeared in the trace be- 
fore as a newly allocated prefix, and any prefix whose 
origin AS has changed as a re-assigned prefix. To esti- 
mate the upper bound on the message overhead, we add 
the full certificate chain to each BGP update announcing 
a newly allocated or re-assigned prefix. 


Finally, we add the update messages triggered by key 
rollover events to the IPA trace. Let a key rollover inter- 
val be T;. seconds. We let each AS randomly choose a 
key rollover time ¢ during the 77. interval. We then add 
BGP updates that include the rekeying AS’s new certifi- 
cates for all its prefixes and its child ASes’ prefixes at 
time ¢ in our trace. We add updates for both routing and 
identity key rollovers (§ 4.5). We assume that as an upper 
bound, each AS changes its routing keys once a week, 
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Figure 7: The cumulative distribution of an |PA BGP up- 
date message size. 
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Figure 8: The update traffic rate a RouteViews server sees 
averaged over 1-minute intervals during one week. 


and its identity keys once a month. 


Message Overhead: Figure 7 shows the cumulative dis- 
tribution of an IPA message size in one day’s trace (Au- 
gust 1, 2010). The distributions in other days are similar 
and hence omitted. For comparison, we also show the 
distribution of an original BGP message size. As can be 
seen, over 80% of the IPA messages are smaller than 500 
bytes. Given that each IP prefix certificate is around 650 
bytes (8 4.1), we can infer that over 80% of the messages 
do not carry any certificate, indicating that the caching 
mechanism described in § 4.3 is effective in reducing 
message overhead. 

Figure 8 shows the IPA BGP update rate averaged over 
1-minute bins in one week (August 1-7). The results 
during other weeks are similar and are omitted for clar- 
ity. For comparison, we also show the vanilla BGP up- 
date rate. The RouteViews server we use peers with 37 
large ISPs. So we expect that the update process it sees 
is representative of what a BGP router sees in a large 
ISP [8]. The rate shown in Figure 8 is the aggregate ar- 
rival rate over all peers of the server. As can be seen, IPA 
increases the update traffic rate compared to the vanilla 
BGP. The 1-minute average aggregate update rate is usu- 
ally less than 200KB/s. Since there are 37 peers, each 
peer on average receives less than 6KB/s update traffic. 
We think this overhead is acceptable compared to today’s 
core routers’ link capacities ( 10Gbps or 40Gbps). 


Processing Overhead: We evaluate an IPA router’s pro- 
cessing overhead by measuring 1) the fraction of CPU 
time it takes to process IPA’s BGP messages, and 2) each 
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Figure 9: The CPU time taken to process the messages re- 
ceived per 1-minute bin. 
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Figure 10: The arrival and departure time of each message 
received during a day. The message number is in the unit 
of million (M). 


message’s processing latency. We aggregate the BGP up- 
date messages into 1-minute bins to measure the CPU 
utilization. We feed the messages arrived in each bin to 
our IPA router implementation, measure the aggregate 
processing time, and compare it with the bin size. 

Figure 9 shows the result during a one-day period (Au- 
gust 1, 2010) with 1-minute bins. The results for other 
days are similar and we omit them for clarity. For com- 
parison, we also show the CPU time a XORP BGP router 
spends to process the original BGP trace. For each time 
bin, IPA takes more time to process the messages than 
the vanilla BGP, because it needs to validate new certifi- 
cates piggybacked in the incoming messages. However, 
the CPU time that the router spends to process each 1- 
minute bin messages is usually less than 30 seconds, in- 
dicating that the router’s CPU utilization is less than 50% 
and CPU is not a bottleneck. We may further improve 
our implementation’s efficiency by applying instruction- 
level optimization to the RSA algorithm [40]. 

We further evaluate IPA’s processing latency and ex- 
amine whether it can keep up with the update arrival rate. 
We feed each update to the IPA router implementation 
according to the time it arrives. Figure 10 shows the ar- 
rival and departure time of each message. As can be seen, 
the arrival and departure lines almost overlap with each 
other, indicating that our implementation running on a 
commodity PC can keep up with the update arrival rate 
of the Route Views server. 


Memory Overhead: ‘To evaluate IPA’s memory over- 
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Figure 11: The number of updates received by each Route- 
Views vantage point. 


head, we feed the IPA BGP trace to our IPA implementa- 
tion, and measure the memory needed to store all certifi- 
cate caches. With our implementation, the trusted certifi- 
cate cache consumes around 356MB memory using the 
BGP table data shown in Table |. Our implementation 
stores only one physical copy for each certificate. The 
same certificates in different caches are pointers to the 
physical copy. The incoming cache uses ~1.5MB mem- 
ory to store the pointers. An outgoing cache uses at most 
7MB, because it only need store a hash value for each 
certificate. This memory overhead is moderate because 
a router need not use these certificates in the packet for- 
warding time and can store them in low-cost DRAM. 


7.3 Adoptability 


In this section, we use real Internet experiments and anal- 
ysis to evaluate IPA’s adoptability. An adoptable design 
must satisfy two conditions: gradually deployable and 
providing incentives to early adopters. 


7.3.1 Gradual Deployment 


IPA uses the top-level DNSSEC infrastructure and BGP 
to certify and distribute IP prefix certificates. We evalu- 
ate whether early adopters can gradually deploy IPA in 
each system. 


DNSSEC: First, we evaluate whether a legacy DNSSEC 
implementation can serve the DNSSEC records and re- 
vocation lists needed by IPA. We deploy a BIND9 DNS 
server which supports DNSSEC natively and has the 
largest installation base [16]. We use the DNSSEC sign- 
ing tool bundled with the server software to generate the 
DNSSEC zone records for the IP prefixes allocated by 
IANA and all five regional Internet registries, and con- 
figure the server to serve the records and the revocation 
lists. We then use a legacy DNS client dig to fetch them. 
The dig client successfully retrieves all the records, in- 
dicating that the Internet registries can directly serve the 
DNSSEC records required by IPA without modifying 
DNS servers or breaking DNS clients. 


BGP: We use BGP’s transitive and optional path at- 
tributes to carry IPA-related fields. This design allows 
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Figure 12: The AS path length distribution of the received 
updates that carry the optional and transitive test attribute 
we inject. The path is from a RouteViews vantage point to 
the injection location. 


upgraded ASes to run the IPA protocols even if they are 
connected by legacy routers. This is because according 
to the BGP standard [51], legacy routers should forward 
any transitive and optional attribute. 

To test IPA’s compatibility with legacy BGP routers, 
we use a modified Quagga [11] BGP daemon to inject a 
BGP update with a transitive and optional attribute. We 
then monitor the propagation of this update from multi- 
ple RouteViews’ vantage points. On August 27, 2010, 
we injected one such update to BGP using the BGP bea- 
con platform maintained by RIPE RIS [13]. The update 
includes a previously unused prefix and a 3KB path at- 
tribute with an unknown type code 99. Figure 11 shows 
the number of updates observed by each RouteViews 
vantage point and among them how many still carry the 
attribute. For the updates still carrying the attribute, Fig- 
ure 12 shows the AS path length distribution from their 
vantage points to the injection point. As can be seen, 
each vantage point observes at least one update carry- 
ing the attribute, and most of the updates carrying the at- 
tribute have successfully traversed multiple legacy ASes. 

The RouteViews vantage points also receive many up- 
dates without the attribute. We suspect that this is caused 
by a Cisco software bug triggered by the injected up- 
date [17]. The bug causes certain Cisco router models to 
corrupt the path attribute. Consequently, a downstream 
router may reset the connection or remove the corrupted 
attribute. Given the prevalence of Cisco routers, we think 
that the result is encouraging. We expect that the affected 
routers will soon patch up this bug, and we will observe 
much more updates carrying the test attribute if we repeat 
this experiment. 


7.3.2 Incentives for Early Adopters 


We now discuss how the IPA design provides incentives 
for early adopters. Our analysis is based on the adopt- 
ability model presented in [26, 43]. The model assumes 
that each potential adopter is rational, and will have in- 
centives to adopt a security mechanism if the security 
benefits outweigh the adoption costs. Because it is diffi- 
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cult to quantify costs, we use the model to qualitatively 
argue that IPA provides stronger incentives for adoption 
than previous work [34, 38, 46, 56, 57]. Thus, it is more 
likely to be adopted than previous work from a cost- 
effective perspective. We do not claim that IPA will be 
adopted, as many other factors (e.g., politics) may affect 
the adoption process. 

IPA’s deployment involves four key parties: Internet 
registries, ASes, router vendors, and OS vendors. For 
simplicity, we focus on discussing the deployment incen- 
tives for the Internet registries and ASes, as past experi- 
ences of deploying DNSSEC [50] and IPv6 [36] suggest 
that they are often the deployment bottlenecks. 

For the Internet registries, IPA achieves similar secu- 
rity benefits as previous work that requires a PKI [34, 
38, 46, 56, 57], but has significantly lower deployment 
and management costs. This is because IPA uses the top- 
level DNSSEC infrastructure to bind an IP prefix to its 
owner’s key. A DNSSEC-enabled registry need not de- 
ploy or manage any additional infrastructure to deploy 
IPA. Therefore, we believe that the Internet registries will 
have stronger incentives to deploy IPA than deploy a ded- 
icated PKI required by previous work. 

The IPA design also provides stronger deployment in- 
centives for ASes than previous work, because ASes 
need not wait for the Internet registries to deploy a PKI 
and need not deploy additional certificate distribution in- 
frastructures. Once the Internet registries have deployed 
IPA using DNSSEC, the top-level ASes that obtain IP 
prefixes directly from those registries can obtain imme- 
diate security benefits by distributing their IP prefix cer- 
tificates in BGP and signing their BGP messages. These 
ASes will form a “club” to prevent prefix hijacking at- 
tacks within the club [26]. Using the IP prefix delegation 
hierarchy inferred in § 7.2.2, we find that such top-level 
ASes account for more than 78% of the total ASes. Once 
the top-level ASes have deployed IPA, their customers 
can obtain security benefits by adopting IPA, and so on. 
As the size of the protected club increases, the immedi- 
ate security benefits that an adopter obtains also increase, 
which encourages more adopters, and can lead to a net- 
work effect of adoption [26]. 


7.4 Security Analysis 


IPA bootstraps accountability with cryptography-based 
secure identifiers. Its security builds on the secrecy of 
private keys. The design stores private identity keys of- 
fline and uses periodic key rollovers to protect private 
keys. As long as the private keys remain secret, other se- 
curity modules can use IPA to achieve accountable rout- 
ing and forwarding, and DoS mitigation (8 5). 

The IPA design uses self-certifying AS identifiers. An 
AS may mint non-existent child AS identifiers by del- 
egating sub-prefixes to those minted child ASes. How- 


ever, because the minted identifiers are associated with 
sub-prefixes inside the AS’s address space, the network 
can hold malicious ASes accountable by their address 
spaces to prevent them from evading traffic policing or 
gaining unfair shares of network resources (§ 5.3). An 
AS may inflate the AS path length in a BGP message 
by inserting the minted child AS identifiers, but it can 
achieve this goal by padding its own identifier in the mes- 
sage, which is acommon BGP practice. 


$8 Related Work 


The most related work in scope is the AIP architec- 
ture [20], which uses self-certifying identifiers as host 
addresses and domain identifiers. IPA retains the hier- 
archical IP addressing structure, but uses self-certifying 
AS identifiers. Unlike AIP, IPA’s deployment does not 
require host re-numbering or trusted host hardware, but 
it requires the global root of trust of today’s Internet 
(LANA) to continue to exist and function. 

Public Key Infrastructures (PKIs) offer a hierarchical 
way to securely bind an identifier to a public key. Much 
existing work on secure routing, such as S-BGP [38], 
sOBGP [57], psBGP [56], SPV [34], and Origin Authen- 
tication [46], requires the Internet registries to establish 
dedicated global PKIs to certify IP prefix ownerships or 
AS number ownerships. IPA obviates such requirements 
by using the existing top-level DNSSEC infrastructure 
to certify IP prefix allocations and using self-certifying 
identifiers as AS numbers. soBGP proposes to use a new 
type of BGP message to distribute various certificates in 
the routing system, while IPA uses a standard BGP ex- 
tension to distribute IP prefix certificates. 

The DNS CERT resource record (RR) [37] provides a 
generic way to store multiple types of certificates such 
as X.509, SPKI, and PGP with a DNS name. These cer- 
tificates do not necessarily certify the DNS zone delega- 
tions, and hence do not certify IP prefix delegations. In 
contrast, IPA uses the Designated Signer and DNSKEY 
RRs rather than the CERT RR to map a reverse DNS 
zone delegation to an IP prefix delegation. 

Simon et al. define network-layer accountability as 
traffic source identification and malicious traffic deter- 
rence [54]. Their design assumes pairwise and transitive 
trust between ASes, and uses ingress filtering and an evil- 
bit in a packet header to stop DoS flooding traffic. How- 
ever, if an AS within the trusted accountable group be- 
comes compromised or malicious, it may fail to perform 
ingress filtering or set the evil-bit, rendering the design 
ineffective. IPA provides a similar form of accountabil- 
ity, but uses cryptography to establish accountability and 
is robust to malicious or compromised ASes. 

An early version of IPA [58] outlines its main design 
modules. This work provides essential design details, an 
IPA prototype, and a comprehensive evaluation regarding 
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IPA’s performance, adoptability, and security properties. 


9 Conclusion 


Lack of accountability makes the Internet vulnerable to 
many attacks, including source address spoofing, DoS 
flooding, prefix hijacking, and route forgery attacks. This 
work presents IPA, a design that bootstraps accountabil- 
ity in today’s Internet with deployable and low-cost en- 
hancements. IPA uses the top-level DNSSEC infrastruc- 
ture to securely bind an IP prefix to an AS’s public key 
and distributes these secure bindings using the routing 
system itself to lower deployment costs. We show that 
IPA enables a suite of security solutions [38, 43, 45] that 
collectively can combat the aforementioned network- 
layer attacks. We have presented the detailed IPA design, 
evaluated its performance, and shown that it is gradu- 
ally deployable and provides stronger incentives for early 
adoption than previous proposals [34, 38, 46, 56, 57]. 


Acknowledgment 


We thank Jeff Chase and the NSDI reviewers for their 
useful comments, and David Andersen for shepherding 


the paper. This work is supported in part by NSF awards 
CNS-0845858, CNS-1040043, and CNS-1017858. 


References 


[1] APNIC DNSSEC Service. http://www.apnic.net/services/ 
services-apnic-provides/registration-services/dnssec. 

[2] ARIN DNSSEC Deployment Plan. https://www.arin.net/ 
resources/dnssec/index.html. 

[3] BIND. https://www.isc.org/software/bind. 

[4] DNSSEC Keys. http://www.ripe.net/dnssec-keys/index.html. 

[5] DNSSEC Policy and Practice Statement. http://www.ripe.net/rs/ 
reverse/dnssec/dps.html. 

[6] DNSSEC Trust Anchors From ARIN. https://www.arin.net/ 
resources/dnssec/trust_anchors.html. 

[7] in-addr.arpa Transition. http: //in-addr-transition.icann.org. 

[8] Internet AS-level Topology on March Ist, 2011. 
http://irl.cs.ucla.edu/topology. 

[9] IPv6 Support in BIND 9. http://www.bind9.net/manual/bind/9.3. 

2/Bv9ARM.ch04.html. 

M Root DNS Server. http://m.root-servers.org. 

Quagga Routing Suite. http://www. quagga.net. 

RADb: Routing Assets Database. http: //www.radb.net. 

RIS Routing Beacons. http://www.ripe.net/projects/ris/docs/ 

beacon.html. 

SecSpider the DNSSEC Monitoring Project. 

ucla.edu. 

CIDR Report. http: //www.cidr-report.org, 2006. 

DNS Survey: October 2009. http://dns.measurement-factory. 

com/surveys/200910.htm1, 2009. 

Cisco Patches Bug That Crashed 1 Percent of Internet. http://www. 

reuters.com/article/idUS418825996320100831, 2010. 

DNSSEC Signatures in Reverse DNS Zones Now Enabled. http://www. 

apnic.net/publications/news/2010/dnssec-signatures, 2010. 

Root DNSSEC Status Update, 2010-07-16. http: //www.root-dnssec. 

org/2010/07/16/status-update- 2010-07-16, 2010. 

D. G. Andersen, H. Balakrishnan, N. Feamster, T. Koponen, D. Moon, and 

S. Shenker. Accountable Internet Protocol (AIP). In ACM SIGCOMM, 

2008. 

R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose. DNS Security 

Introduction and Requirements. RFC 4033, 2005. 

R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose. Protocol Mod- 

ifications for the DNS Security Extensions. RFC 4035, 2005. 

R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose. 

Records for the DNS Security Extensions. RFC 4034, 2005. 

J. Bennett and H. Zhang. Hierarchical Packet Fair Queueing Algorithms. 

IEEE/ACM TON, 5(5), 1997. 


[10] 
[11] 
[12] 
[13] 
[14] http://secspider.cs. 


[15] 
[16] 


[17] 
[18] 
[19] 


[20] 


[21] 
[22] 
[23] 


Resource 


[24] 


NSDI 711: 8th USENIX Symposium on Networked Systems Design and Implementation 


[25] 
[26] 
[27] 
[28] 
[29] 
[30] 


[31] 


[32] 


[33] 
[34] 


[35] 


[36] 
[37] 
[38] 
[39] 
[40] 


[41] 


[42] 


[43] 
[44] 
[45] 
[46] 
[47] 


[48] 


[49] 


[50] 
[51] 


[52] 


[53] 
[54] 


[55] 
[56] 
[57] 
[58] 
[59] 


[60] 


M. A. Brown. Pakistan Hijacks YouTube. http://www.renesys.com/ 
blog/2008/02/pakistan-hijacks-youtube-1.shtm1, 2008. 

H. Chan, D. Dash, A. Perrig, and H. Zhang. Modeling Adoptability of 
Secure BGP Protocols. In ACM SIGCOMM, 2006. 

M. Crawford. Binary Labels in the Domain Name System. RFC 2673, 
1999. 

Deterlab. http: //www.deterlab.net. 

DNS Deployment Initiative. http: //www.dnssec-deployment.org. 
H. Eidnes, G. de Groot, and P. Vixie. Classless IN-ADDR.ARPA Delega- 
tion. RFC 2317, 1998. 

M. Feldman, C. Papadimitriou, J. Chuang, and I. Stoica. Free-riding and 
Whitewashing in Peer-to-Peer Systems. JEEE JSAC, 24(5):1010—1019, 
2006. 

A. Haeberlen, P. Kuznetsov, and P. Druschel. PeerReview: Practical Ac- 
countability for Distributed Systems. In ACM Symposium on Operating 
Systems Principles, 2007. 

M. Handley, E. Kohler, A. Ghosh, O. Hodson, and P. Radoslavov. Design- 
ing Extensible IP Router Software. In USENIX/ACM NSDI, 2005. 

Y. Hu, A. Perrig, and M. Sirbu. SPV: Secure Path Vector Routing for Se- 
curing BGP. In ACM SIGCOMM, 2004. 

Y.-C. Hu, D. McGrew, A. Perrig, B. Weis, and D. Wendlandt. 
(R)Evolutionary Bootstrapping of a Global PKI for Securing BGP. In ACM 
HotNets-V, 2006. 

G. Huston. Measuring IPv6 Deployment. http://www.internetac. 
org/wp- content /uploads/2010/02/apnic-v6-oecd1.pdf, 2009. 

S. Josefsson. Storing Certificates in the Domain Name System (DNS). RFC 
4398, 2006. 

S. Kent, C. Lynn, and K. Seo. Secure Border Gateway Protocol (S-BGP). 
IEEE JSAC, 2000. 

O. Kolkman and R. Gieben. DNSSEC Operational Practices. RFC 4641, 
2006. 

M. E. Kounavis, X. Kang, K. Grewal, M. Eszenyi, S. Gueron, and 
D. Durham. Encrypting the Internet. In ACM SIGCOMM, 2010. 

B. Lampson. Accountability and Freedom. http://research. 
microsoft.com/en-us/um/people/blampson/Slides/ 
AccountabilityAndFreedomAbstract .htm, 2005. 

A. Li, X. Liu, and X. Yang. Dirty-Slate Accountable Internet Design. 
Technical Report 2010-07 (available athttp: //www.cs.duke.edu/nds/ 
papers/ipa-tr.pdf), Duke University, 2010. 

X. Liu, A. Li, X. Yang, and D. Wetherall. Passport: Secure and Adoptable 
Source Authentication. In USENIX/ACM NSDI, 2008. 

X. Liu, X. Yang, and Y. Lu. To Filter or to Authorize: Network-Layer DoS 
Defense Against Multimillion-node Botnets. In ACM SIGCOMM, 2008. 
X. Liu, X. Yang, and Y. Xia. NetFence: Preventing Internet Denial of 
Service from Inside Out. In ACM SIGCOMM, 2010. 

P. McDaniel, W. Aiello, K. Butler, and J. Ioannidis. Origin Authentication 
in Interdomain Routing. Computer Networks, 50(16):2953—2980, 2006. 

P. Mockapetris. Domain Names — Concepts and Facilities. RFC 1034, 
1987. 

J. Nazario. Estonian DDoS Attacks - A Summary to 
Date. http://asert .arbornetworks.com/2007/05/ 
estonian-ddos-attacks-a-summary-to-date, 2007. 

J. Nazario. | Georgia DDoS Attacks - A Quick Summary of Ob- 
servations. http://asert .arbornetworks.com/2008/08/ 
georgia-ddos-attacks-a-quick-summary-of-observations, 
2008. 

E. Osterweil, M. Ryan, D. Massey, and L. Zhang. Quantifying the Opera- 
tional Status of the DNSSEC Deployment. In JMC, 2008. 

Y. Rekhter, T. Li, and S. Hares. A Border Gateway Protocol 4 (BGP-A4). 
RFC 4271, 2006. 

P. Roberts. Massive Denial Of Service Attack Severs Myan- 
mar From _ Internet. http://threatpost.com/en_us/blogs/ 
massive-denial-service-attack-severs-myanmar-internet- 110310, 
2010. 

RouteViews Project. http: //www.routeviews.org. 

D. R. Simon, S. Agarwal, and D. A. Maltz. AS-based Accountability as a 
Cost-effective DDoS Defense. In USENIX HotBots, 2007. 

Q. Vohra and E. Chen. BGP Support for Four-octet AS Number Space. 
RFC 4893, 2007. 

T. Wan, E. Kranakis, and P. van Oorschot. Pretty Secure BGP (psBGP). In 
NDSS, 2005. 

R. White. Securing BGP Through Secure Origin BGP. The Internet Proto- 
col Journal, 2003. 

X. Yang and X. Liu. Internet Protocol Made Accountable. In ACM HotNets- 
VIIT, 2009. 

X. Yang, D. Wetherall, and T. Anderson. A DoS-Limiting Network Archi- 
tecture. In ACM SIGCOMM, 2005. 

A. R. Yumerefendi and J. S. Chase. Strong Accountability for Network 
Storage. ACM Transactions on Storage, 3(3), 2007. 


USENIX Association 


USENIX Association 


Privad: Practical Privacy in Online Advertising 


Saikat Guha, Bin Cheng, Paul Francis 
Microsoft Research India, and MPI-SWS 
saikat@microsoft.com, {bcheng,francis }@mpi-sws.org 


Abstract 


Online advertising is a major economic force in the In- 
ternet today, funding a wide variety of websites and ser- 
vices. Today’s deployments, however, erode privacy and 
degrade performance as browsers wait for ad networks 
to deliver ads. This paper presents Privad, an online ad- 
vertising system designed to be faster and more private 
than existing systems while filling the practical market 
needs of targeted advertising: ads shown in web pages; 
targeting based on keywords, demographics, and inter- 
ests; ranking based on auctions; view and click account- 
ing; and defense against click-fraud. Privad occupies a 
point in the design space that strikes a balance between 
privacy and practical considerations. This paper presents 
the design of Privad, and analyzes the pros and cons of 
various design decisions. It provides an informal anal- 
ysis of the privacy properties of Privad. Based on mi- 
crobenchmarks and traces from a production advertising 
platform, it shows that Privad scales to present-day needs 
while simultaneously improving users’ browsing experi- 
ence and lowering infrastructure costs for the ad network. 
Finally, it reports on our implementation of Privad and 
deployment of over two thousand clients. 


1 Introduction 


Online advertising is a key economic driver in the In- 
ternet economy, funding a wide variety of websites and 
services. Internet advertisers increasingly work to pro- 
vide more personalized advertising. Unfortunately, per- 
sonalized online advertising comes at the price of indi- 
vidual privacy [23]. Privacy advocates would like to put 
an end to advertising models that violate privacy, and in- 
deed have had some success with startups in the early 
stages of deployment [19]. On the other hand, they have 
had little success with the more entrenched ad brokers 
like Google and Yahoo! [11]. Arguably the reason why 
privacy advocates have failed here is that they offer no 
viable alternatives, and so the privacy solution they pro- 
pose is effectively to end on-line advertising. This paper 
presents a practical and substantially more private online 
advertising system that attempts to offer that alternative. 

To effect real change in the privacy of commercial ad- 
vertising systems, we require that our design goals for 
Privad include commercial viability. This in turn requires 
that Privad: 


1. is private enough that privacy advocacy groups! 
support it, 

2. targets ads well enough to produce better click- 
through rates (or conversion rates, etc.) than current 
systems, 

3. 1s as or less expensive to deploy than current sys- 
tems, and 

4. fits within the current business framework for on- 
line advertising, and therefore more likely has a vi- 
able business model. In particular, the interaction 
between Privad and end users, advertisers, and pub- 
lishers, should not significantly change. 


These goals are contradictory in nature, and much of 
the design challenge is finding the right balance of pri- 
vacy and practicality. Although our arguments for scal- 
ability (goal 3) are strong and are buttressed by trace- 
based analysis, microbenchmarks, and deployment, we 
cannot definitively say that we have satisfied the other 
goals. While we hope to demonstrate better targeting 
through an experimental deployment (goal 2), this re- 
mains future work. The business model (goal 4) can ulti- 
mately only be demonstrated through a successful com- 
mercial deployment. While we have discussed our de- 
sign with a number of privacy advocates, and have got- 
ten favorable responses (goal 1), it is nevertheless hard 
to predict how they would react to a serious commercial 
deployment. 

In practice we believe that a commercial deployment 
of Privad would be a constant balancing act between the 
goals listed above: the broker would gauge the reaction 
of privacy advocates, and strengthen or weaken privacy 
in response. In the absence of this commercial deploy- 
ment and meaningful feedback from privacy advocates, 
our design assumes that privacy advocates will be hard 
to win over, and therefore favors privacy concerns over 
business concerns. In other words, our design attempts 
to produce the most private system possible within the 
constraint of achieving a merely feasible business model. 
In this paper, we nail down a design, present arguments 
as to why our practical goals are feasibly satisfied, and 


'Private organizations like the Electronic Frontier Foundation 
(EFF) and the American Civil Liberties Union (ACLU), and govern- 
ment organizations like the Federal Trade Commission (FTC) and Eu- 
roprise. 
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describe the security and scalability properties that our 
design ultimately achieves. 

Privad preserves privacy by maintaining user profiles 
on the user’s computer instead of in the cloud. A small 
amount of information necessarily leaves the user’s com- 
puter: coarse-grained classes of ads a user is interested 
in, the ads the user has viewed or clicked on and the 
websites that carried the ads, and the ranking of ads for 
auctions. This information, however, is handled in such 
a way that no party can link it back to the individual user, 
or link together multiple pieces of information about the 
same user. An anonymizing proxy hides the user’s net- 
work address, while encryption prevents the proxy from 
learning any user information. A trusted open-source ref- 
erence monitor at the user’s computer prevents any Per- 
sonally Identifying Information (PII) other than network 
address from leaving the computer. 

By contrast, current advertising systems, such as 
Google and Yahoo!, are in a deep architectural sense not 
private: they gather information about users and store 
it within their data centers. These systems do not lend 
themselves to being audited by privacy advocates or reg- 
ulators. Users are essentially required to completely trust 
these systems to not do anything bad with the informa- 
tion. This trust can easily be violated, as for instance in 
a confirmed case where a Google employee spied on the 
accounts of four underage teens for months before the 
company was notified of the abuses [4]. 

Privad is considerably more private than current sys- 
tems (though admittedly this is a low bar; we believe 
that privacy advocates will hold us to a much higher stan- 
dard). Privad does not, for instance, require trust in any 
single organization. Additionally, Privad is designed to 
be auditable by third-parties. Most of this auditing is au- 
tomatic, through the use of a simple reference monitor 
in the client. While Privad makes it much harder for an 
organization to gather private user information, Privad’s 
privacy protocols are not bullet-proof (for instance with 
respect to collusion and covert channels), and so Privad 
allows the use of human-assisted or learning-based mon- 
itoring to detect misbehavior at the semantic level. 

The anonymizing proxy (called dealer) is a significant 
change to the current business framework (goal 4). The 
dealer is run by an untrusted third-party organization, 
e.g. datacenter operators. We discuss in later sections 
the justification behind the dealer model, auditing mech- 
anisms, and the feasibility of providing the service. We 
estimate the dealer’s operating cost at around a cent per 
user per year (Section 4). This can easily be met with 
funding from privacy-advocates or levies on brokers. 

The other significant change is client software on the 
users’ computers. A key challenge, then, is incentivis- 
ing deployment of this client software. Privad is not 
aimed for users that disable ads altogether. For users 
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Figure 1: The Privad architecture 


that do view and occasionally click ads, deploying re- 
quires first that Privad not degrade user experience in 
any way. We can ensure this by only showing ads in the 
same ad boxes that are common today (unlike previous 
adware, which employed disruptive advertising). Sec- 
ond, especially early on there must be some positive in- 
centive for users to install it. This could be done through 
bundling other useful software, shopping discounts, or 
other incentives. Finally, it requires that privacy advo- 
cates endorse Privad. This at least prevents anti-virus 
software from actively removing the Privad client. Ide- 
ally, it even leads to privacy-conscious browser vendors 
(e.g. Firefox), anti-virus companies, or operating sys- 
tems installing it by default. 

The contributions of this paper are as follows: it 
presents a complete practical private advertising sys- 
tem. It describes the design of Privad, presents a fea- 
sibility study, and contributes a security analysis in- 
cluding both privacy and click-fraud aspects. It also 
gives a performance evaluation of our complete proof- 
of-concept implementation and pilot deployment of over 
two thousand users. Overall, Privad represents an argu- 
ment that highly-targeted practical online advertising and 
good user-privacy are not mutually exclusive. 


2 Privad Overview 


There are six components in Privad: client software, 
client reference monitor, publisher, advertiser, broker, 
and dealer (see Figure 1). Publisher, advertiser, and bro- 
ker all have analogs in today’s advertising model, and 
play the same basic business roles. Users visit publisher 
webpages. Advertisers wish their ads to be shown to 
users on those webpages. The broker (e.g. Google) 
brings together advertisers, publishers, and users. For 
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each ad viewed or clicked, the advertiser pays the bro- 
ker, and the broker pays the publisher. 


There are three new key components for privacy in Pri- 
vad. First, the task of profiling the user is done at the 
user’s computer rather than at the broker. This is done 
by client software running on the user’s computer. Sec- 
ond, all communication between the client and the bro- 
ker is proxied anonymously by a kind of proxy called the 
dealer. The dealer also coordinates with the broker (us- 
ing a protocol that protects user privacy) to identify and 
block clients participating in click-fraud. Finally, a thin 
trusted reference monitor between the client and the net- 
work ensures that the client conforms to the Privad proto- 
col and provides a hook for auditing the client software. 
Encryption is used to prevent the dealer from seeing the 
contents of messages that pass between the client and the 
broker. The dealer prevents the broker from learning the 
client’s identity or from linking separate messages from 
the same client. 


At a high level, the operation of Privad goes as fol- 
lows. The client software monitors user activity (for 
instance webpages seen by the user, personal informa- 
tion the user inputs into social networking sites, possibly 
even the contents of emails or chat sessions, and so on) 
and creates a user profile which contains a set of user at- 
tributes. These attributes consist of short-term and long- 
term interests and demographics. Interests include prod- 
ucts or services like sports.tennis.racket or outdoor.lawn- 
care. Demographics include things like gender, age, 
salary, and location. 


Advertisers submit ads to the broker, including the 
amount bid and the set of interests and demographics tar- 
geted by each ad. The client requests ads from the broker 
by anonymously subscribing to a broad interest category 
combined with a few broad non-sensitive demographics 
(gender, language, region). The broker transmits a set of 
ads matching that interest and demographics. These ads 
cover all other demographics and fine-grained locations 
within the region, and so are a superset of the ads that 
will ultimately be shown to the user. The client locally 
filters and caches these ads. If the user has multiple in- 
terests, there is a separate subscription for each interest, 
and privacy mechanisms prevent the broker from linking 
the separate subscriptions to the same user. 


Ad auctions determine which ads are shown to the user 
and in what order. The ranking function, identical to the 
one used in industry today, uses in addition to the bid 
information, both user and global modifiers. User mod- 
ifiers are based on things like how well the targeting in- 
formation matches the user, and the user’s past interest in 
similar ads. Global modifiers are based on the aggregate 
click-through-rate (CTR) observed for the ad, the quality 
of the advertiser webpage, etc. 


Reference- 
Monitor 


Client 






Requests 
(clear text) 


Requpsts 
untrusted (encrypted) 


black-box 


Browser Sandbox 


Figure 2: The Client framework 


When the user browses a website that provides ad 
space, or runs an application like a game that includes 
ad space, the client selects an ad from the local cache 
and displays it in the ad space. A report of this view 1s 
anonymously transmitted to the broker via the dealer. If 
the user clicks on the ad, a report of this click is like- 
wise anonymously transmitted to the broker. These re- 
ports identify the ad and the publisher on who’s webpage 
or application the ad was shown. Privacy mechanisms 
prevent multiple reports from the same user from being 
linked together by the broker. The broker uses these re- 
ports to bill advertisers and pay publishers. 

Unscrupulous users or compromised clients may 
launch click-fraud attacks on publishers, advertisers, or 
brokers. Both the broker and dealer are involved in de- 
tecting and mitigating these attacks (Section 3.4). When 
the broker detects an attack, it indicates to the dealer 
which reports relate to the attack. The dealer then traces 
these back to the clients responsible, and suppresses fur- 
ther reports from attacking clients, mitigating the attack. 

Users, or privacy advocates operating on behalf of 
users, must be able to convince themselves that the client 
cannot undetectably leak private information. While hav- 
ing a trusted third-party write the client software might 
appear at first glance to be an option, it doesn’t solve the 
problem — a trusted client simply moves the trust users 
place on brokers today to the third-party. At the same 
time, it requires brokers to make their trade-secret profil- 
ing algorithms known to the third party, and to parties au- 
diting the client. Instead, Privad places a thin trusted ref- 
erence monitor between the client and the network giving 
users and privacy advocates a hook to detect privacy vi- 
olations (Section 3.5). It treats the client in a black-box 
manner (Figure 2), allowing the broker to use existing 
technological and legal frameworks for protecting trade- 
secret code. The reference monitor itself is simple, open 
source, and open to validation so its correctness can be 
verified, and can therefore be trusted by the user. 

Note that Figure 1 does not portray the interaction that 
takes place between client and advertiser after an ad is 
clicked. For the purpose of this paper, we assume that a 
click brings the client directly to the advertiser as is the 
case today. We realize that this is a problem, because the 
finer-grained targeting of Privad gives unscrupulous ad- 
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vertisers more information than they get today. The Pri- 
vad architecture leaves open the possibility of privately 
proxying the post-click session between client and ad- 
vertiser, and even protecting the client from inadvertently 
releasing sensitive information. Because of space limita- 
tions, we do not further discuss this option, and only con- 
sider protecting the user from the broker and dealer. Pri- 
vad does not modify today’s relationship between client 
and publisher. 


3 Privad Details 


This section provides details on ad dissemination, ad 
auctions, view/click reporting, click-fraud defense and 
the reference monitor. It also puts forth some of the ra- 
tionale for our design decisions. These details represent 
a snapshot of our current thinking. While ad dissemi- 
nation, reporting, and reference monitor are quite stable, 
the click-fraud defense, and auctions may easily evolve 
as we do more analysis and testing. We present them 
here so as to present a complete argument for Privad’s 
viability. 


3.1 Ad Dissemination 


The most privacy-preserving way to disseminate ads 
would be for the broker to transmit all ads to all clients. 
In this way, the broker would learn nothing about the 
clients. In [13], we measured Google search ads and con- 
cluded that there are too many ads and too much ad churn 
for this kind of broadcast to be practical. We observed 
that the number of impressions for ads is highly skewed: 
a small fraction of ads (10%) garner a disproportionate 
fraction of impressions (80%). Furthermore, this 10% of 
ads tend to be more broadly targeted and therefore of in- 
terest to many users. It may therefore be cost effective 
to disseminate only this small fraction of ads to all users, 
for instance using a BitTorrent-like mechanism. For the 
remaining 90%, however, a different approach is needed. 
We therefore design a privacy-preserving pub-sub mech- 
anism between the broker and client to disseminate ads. 


The pub-sub protocol (Figure 3) consists of a client’s 
request to join a channel (defined below), followed by 
the broker serving a stream of ads to the client. 


Each channel is defined by a single interest attribute 
and limited non-sensitive broad demographic attributes, 
for instance wide geographic region, gender, and lan- 
guage. The purpose of the additional demographics 1s to 
help scale the pub-sub system: limiting an interest by re- 
gion or language greatly reduces the number of ads that 
need to be sent over a given channel while still main- 
taining a large number of users in that channel (in the 
k-anonymity sense). Channels are defined by the bro- 
ker. The complete set of channels is known to all clients, 
for instance by having dealers host a copy (signed by 
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E,(chan, C), Rid 
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Figure 3: Message exchange for pub-sub ad dissemination. 
E,(M) represents the encryption of message M under key x. 
B is the public key of the broker. C’ is a symmetric key gener- 
ated by the client for only this subscription. 


the broker). A client joins a channel when its profile at- 
tributes match those of the channel. 

The join request is encrypted with the broker’s public 
key (B) and transmitted to the dealer. The request con- 
tains the pub-sub channel (chan), and a per-subscription 
symmetric key (C’) generated by the client and used by 
the broker to encrypt the stream of ads sent to the client. 
The dealer generates for each subscription a unique (ran- 
dom) request ID (Rid). It stores a mapping between Rid 
and the client, and appends the Rd to the message for- 
warded to the broker. The broker attaches the Rid with 
ads published, which the dealer uses to lookup the in- 
tended client to forward the ads to. 

The broker determines which ads should be sent and 
for how long they should be cached at the client. For 
instance, the broker stops sending ads for an advertiser 
when the advertiser nears his budget limit. Note that not 
all ads transmitted are appropriate for the user, and so 
may not be displayed to the user. For instance, an ad 
may be targeted towards a married person, while the user 
is single. Because the subscription does not specify mari- 
tal status, the broker sends all ads independent of marital 
status or other targeting, and the client filters out those 
that do not match. Over time, the broker can estimate the 
number of ads that must be sent out for a particular ad- 
vertiser to generate a target number of views and clicks. 


3.2 Ad Auctions 


Auctions determine which ads are shown to the user and 
in what order. For the advertiser, the auction provides a 
fair marketplace where the advertiser can influence the 
frequency and position of its ads through its bids. The 
broker additionally wants to maximize revenue, primar- 
ily by maximizing click-through rates (CTR). This is be- 
cause most of today’s advertising systems charge adver- 
tisers for clicks, not views. The broker also wants to min- 
imize auction churn, generally by using a second-price 
auction [8]. A second-price auction is one whereby the 
bidder pays not the amount he bid, but the amount bid by 
the next lower bidder. This prevents the bidder from hav- 
ing to frequently change its bid in an attempt to probe for 
the bid value one unit higher than the next lower bidder. 

Compared to today’s brokers, which have full infor- 
mation about the system and can decide exactly which 
ads are shown where, in Privad both the client and the 
broker influence which ads are shown. This changes 
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Figure 4: Industry-standard GSP Auction. Client annotates ads (across all channels) with quality of match, or random number if 
the ad doesn’t match the user. Dealer mixes annotations from multiple clients. Broker ranks ads by bid, global click-through rate, 
advertiser quality, and match quality, and annotates the result with opaque bid information. Dealer slices auction result by client. 
Client filters out non-matching ads. Client reports encrypted second-price bid on click. 


many aspects of the auction: for instance when the auc- 
tion is run, over what set of ads, and the criteria by which 
second price is decided. The design space for Privad auc- 
tions is very large, and its complete exploration is a topic 
of further study. Nevertheless we describe two proof-of- 
concept auctions here. 

A simple auction from this design space goes as fol- 
lows. The broker periodically runs the auction over the 
set of ads targeted to a given pub-sub interest channel, 
producing a ranked set of ads. The ranking is preserved 
when ads are sent to clients. Clients filter out non- 
matching ads, slightly modify the ranking according to 
the quality of the demographic match for each ad, and 
show ads to users based on the modified ranking. When 
the broker receives a click report, it uses its original rank- 
ing to select the second price. 

This auction is clearly different from Google’s GSP 
auction [8]. For instance, with GSP, the auction is run 
when the browser requests a set of ads, and the second 
price is based on the ad below the clicked ad on the ac- 
tual web page. We cannot necessarily say that our simple 
auction is worse than or better than GSP—this is a com- 
plex question and depends on, among other things, the 
evaluation criteria. As a demonstration of commercial 
viability, however, we now present a more complex auc- 
tion that is identical to the industry-standard GSP auction 
mechanism. 

In this second approach (Figure 4), the broker con- 
ducts the auction in a separate exchange. First, ads are 
sent to clients using pub-sub as originally described. The 
broker attaches a unique instance ID (/7d) to each copy 
of the ad published (not shown in figure). For each ad, 
the client computes a coarse score (U ), typically between 
1 and 5, as follows: for ads that match the user, the score 
reflects the quality of match with 5 signifying the best 
possible match. For ads that don’t match the user, the 
score is a random number. To rank ads, the client sends 
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(Jid, U) tuples for all ads in the client’s database to the 
dealer. The dealer aggregates and mixes tuples for dif- 
ferent clients before forwarding them to the broker. The 
broker ranks all the ads in the message. The ranking is 
based on both global and user modifiers (e.g. bids, CTR, 
advertiser quality, and client score). Note the ranked re- 
sult contains all ads from the same client in the correct 
order, interspersed with ads for other clients (also in their 
correct order). The broker returns this ranked list to the 
dealer. The dealer uses the /7d to slice the list by client 
and forwards each slice to the appropriate client. The 
client discards the ads that do not match the user, and 
stores the rest in ranked order. 


To obtain the GSP second price, the broker encrypts 
the bid information with a symmetric key (4) known 
only to the broker and sends it along with the ad. When 
a set of ads are chosen to be shown to the user, the client 
pairs up the encrypted bid information for ad n + 1 with 
that of ad n. This encrypted bid pair is sent as part of 
the click report, which the broker decrypts to determine 
what the advertiser should be charged. 


3.3. View/Click Reporting 


Ad views and clicks, as well as other ad-initiated user ac- 
tivity (purchase, registration, etc.) needs to be reported 
to the broker. The protocol for reporting ad events (Fig- 
ure 5) is straightforward. The report containing the ad 
ID (Aid), publisher ID (Pid), and type of event (view, 
click, etc.) is encrypted with the broker’s public-key and 
sent through the dealer to the broker. The dealer attaches 
a unique (random) request ID (Rid) and stores a map- 
ping between the request ID and the client, which it uses 
later to trace suspected click-fraud reports in a privacy- 
preserving manner. 
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Figure 5: Message exchange for view/click reporting and 
blocking click-fraud. B is the public key of the broker. Aid 
identifies the ad. Pd identifies publisher website or applica- 
tion where the ad was shown. For second-price auctions, the 
Opaque auction result is included. Rid uniquely identifies the 
report at the dealer. 


3.4 Click-Fraud Defense 


Click-fraud consists of users or bots clicking on ads for 
the purpose of attacking one or more parts of the system. 
It may be used to drive up a given advertiser’s costs, or 
to drive up the revenue of a publisher. It can also be used 
to drive up the click-through-ratio of an advertiser so that 
that advertiser is more likely to win auctions. 

Generally speaking, privacy makes click-fraud more 
challenging because clients are hidden from the bro- 
ker. Privad addresses this challenge through an explicit 
privacy-preserving protocol between broker and dealer. 
Both the broker and dealer participate in detecting and 
blocking click-fraud; the dealer by measuring view and 
click volumes from clients, the broker by looking at over- 
all click behavior for advertisers and publishers. 

Blocking a fraudulent client once an attack is detected 
is straightforward. When a publisher or advertiser is un- 
der attack, the broker tells the dealer which report IDs are 
suspected as being involved in click-fraud. The dealer 
traces the report ID back to the client, and if the client 
is implicated more than some set threshold, subsequent 
reports from that client are blocked. 

As with today’s ad networks, there is no silver bullet 
for detecting click-fraud. And like ad networks today, 
the approach we take is defense in depth — a number of 
overlapping detection mechanisms (described below) op- 
erate in parallel; each detection mechanism can be fooled 
with some effort; but together, they raise the bar. 

Per-User Thresholds. The dealer tracks the number 
of subscriptions, and the rates of view/click reports for 
each client (identified by their IP address). Clients that 
exceed thresholds set by the broker are flagged as suspi- 
cious. The broker may provide a list of NATed networks 
or public proxies so higher thresholds may apply to them. 

Blacklist. Dealers flag clients on public blacklists, 
such as lists maintained by anti-virus vendors or net- 
work telescope operators that track IP addresses partici- 
pating in a botnet. Dealers additionally share a blacklist 
of clients blocked at other dealers. 

Honeyfarms. The broker operates honeyfarms that 
are vulnerable to botnet infection. Once infected, the 
broker can directly track which publishers or advertis- 
ers are under attack. When a report matching the attack 
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signature is received, the broker asks the dealer to flag 
the originating client as suspicious. 

Historical Statistics. The dealer and broker maintains 
respectively a number of per-client, and per-publisher 
and per-advertiser statistics including volume of view re- 
ports, and click-through rates. Any sudden increase in 
these statistics cause clients generating the reports to be 
flagged as suspicious. 

Premium Clicks. Based on the insight behind [21], a 
user’s purchase activity is used as an indication of hon- 
est behavior. Clicks from honest users command higher 
revenues. The broker informs the dealer which reports 
are purchases. The dealer flags the origin client as “pre- 
mium” for some period of time, and attaches a single 
“premium bit” to subsequent reports from these clients. 

Bait Ads. An approach we are actively investigating 
is something we term “bait ads” (similar to [14]), which 
can loosely be described as a cross between CAPTCHAs 
and the invisible-link approach to robot detection [27]. 
Basically, bait ads contain the targeting information of 
one ad, but the content (graphics, flash animation) of a 
completely different ad. For instance, a bait ad may ad- 
vertise “dog collars” to “cat lovers”. The broker expects 
a very small number of such ads to be clicked by humans. 
A bot clicking on ads, however, would unwittingly trig- 
ger the bait. It is hard for a bot to detect bait, which 
for image ads amounts to solving semantic CAPTCHAs 
(e.g. [9]). Bait ads are published by the broker just like 
normal ads. When a click for a bait ad is reported, the 
broker informs the dealer, which flags the client as po- 
tentially suspicious. 

These mechanisms operate in concert as follows: per- 
user thresholds force the attacker to use a botnet. Hon- 
eyfarms help discover botnets, and blacklists limit the 
amount of time individual bots are of use to the attacker. 
Historical statistics block high-intensity attacks, instead 
forcing the attacker to gradually mount the attack, which 
buys additional time for honeyfarms and blacklists to 
kick in before significant financial damage is caused. At 
the same time, bait ads disseminated proactively can de- 
tect low volume attacks due to the strong signal gener- 
ated by a relatively small number of clicks, while dis- 
seminated reactively, bait ads can reduce false positives. 
And finally, premium ads, by forcing the attacker to 
spend money to acquire and maintain “premium” status 
for each bot, apply significant economic pressure, which 
is magnified by bots being blacklisted. 

Overall these mechanisms have the effect of more-or- 
less putting Privad back on an even footing with current 
ad networks as far as click-fraud is concerned. 


3.5 Reference Monitor 


The reference monitor has six functions geared towards 
making it difficult for the black-box client to leak pri- 
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vate information. We model the reference monitor on 
Google’s Native Client (NaCl) sandbox [34] that allows 
running untrusted native code within a browser. As with 
NaCl, the sandbox presents a highly narrow and hard- 
ened API to untrusted code, and is itself open to valida- 
tion by security experts and privacy advocates. 

The reference monitor is hardened in at least the five 
following ways. First, the reference monitor validates 
that all messages in and out of the client follow Privad 
protocols. For this, the client is operated in a sandbox 
such that all network communication must go through 
the reference monitor in the clear (Figure 2). Second, 
it is the monitor that encrypts outbound messages from 
the client (and decrypts inbound messages). Third, the 
monitor is the source of all randomness in messages (e.g. 
session keys, randomized padding for encryption etc.). 
Fourth, the monitor may additionally provide cover traf- 
fic or introduce noise to protect user privacy in certain 
Privad operations. Fifth, the monitor arbitrarily delays 
messages or adds jitter to disrupt certain timing attacks. 

Technological means for disrupting covert channels is, 
of course, not enough since the client may attempt to leak 
information through semantic means. For instance, the 
client might send lima-beans when it really means no- 
health-insurance. The sixth and final function of the ref- 
erence monitor is therefore to provide an auditing hook, 
which can be used for instance to interpose a human-in- 
the-loop. Interested users may occasionally inspect mes- 
sages for accuracy, and/or privacy advocates may set up 
honeyfarm clients, train them with specific profiles, and 
monitor them for inconsistent behavior using automated 
techniques presented in [12]. 


3.6 User Profiling 


Even though the client is ultimately in charge of pro- 
filing the user, it can nevertheless leverage existing 
cloud-based crawlers and profilers through a privacy- 
preserving query mechanism. At a high level the query 
protocol is similar to the pub-sub protocol (Figure 3) op- 
erating as a single request-response pair; the request con- 
tains the website URL and the response contains profile 
attributes. Beyond this, the client can locally scrape and 
classify pages, incorporate social feedback, or even al- 
low publisher websites to explicitly influence the profile. 
Overall, the user profiling options in Privad adds to ex- 
isting cloud-based algorithms while preserving privacy, 
and therefore has the potential to target ads better than 
existing systems. 


4 Feasibility 


To validate the basic feasibility of Privad, we estimate 
worst-case network and storage overhead based on a 
trace of ads delivered by Microsoft’s advertising plat- 
form (processing overhead is measured in Section 6). 


Network and storage overhead at the client is due pri- 
marily to pub-sub ad dissemination. We use a trace 
of Bing search ads to determine an expected number 
of channels per client and ads per channel. We make 
the pessimistic assumption that all ads associated with a 
channel are transmitted to all subscriptions for that chan- 
nel. We expect to be far more efficient than this in prac- 
tice, since we can design our pub-sub service so that 
clients receive only fractionally more ads than necessary 
to fill their ad boxes (subject to k-anonymity and adver- 
tiser budget constraints). Summarizing our results, as- 
suming compression and a | MB local cache, we estimate 
the client will download less than 100KB per day on aver- 
age (worst case: 20MB cache, 1.25MB daily download: 
less than a typical MP3 song). Even adjusting for the 
fact that our trace represents a good fraction, but a frac- 
tion nevertheless, of the search advertising market, and 
doesn’t include contextual advertising, this load poses 
little concern. 


We arrive at these estimates as follows: The Bing trace 
we used (for over 2M users in the USA sampled on Sep. 
1, 2010) classifies users and ads into 128 interest cate- 
gories. On average, each user is mapped to 2 interest cat- 
egories on a given day (9 categories in the 99*” percentile 
case). Using 2-4 coarse-grained geographic regions per 
State, we obtain several tens of thousand distinct interest- 
region-gender Privad channels. Remapping Bing ads to 
these channels results, on average, in slightly less than 
2K ads for each channel (10K in the 99*” percentile); 
note, an ad may be mapped to multiple channels. Each 
ad is roughly 250 bytes of text including the URL. This 
results in an average unoptimized daily download size 
of around 1MB (and less than 25MB in the worst case). 
Compressing ad content (in bulk) reduces download size 
by a factor of 10. 


Of these, only the subset matching the user’s other de- 
mographic attributes need to be stored in the client’s lo- 
cal cache. Using the Bing trace’s age-group classification 
alone, we get a factor of 5 reduction in storage. Occupa- 
tion, education, marital-status etc. may further reduce 
storage requirements but we lack data to estimate these. 
Cached ad data can then be used to further reduce client 
network traffic. This requires a slight modification to the 
pub-sub protocol to periodically transfer a bitmap of ac- 
tive/inactive ads on the channel. Based on two weeks of 
trace data, we find that 54% of ads on a channel were 
seen the previous day (and around 70% within the pre- 
vious 4 days; there is little added benefit for caching 
beyond 4 days). Thus with a warmed up IMB cache, 
the client needs to download on average 1O0KB (1.25MB 
worst case) of compressed ad content plus a few tens of 
kilobytes of periodic bitmap data per day. Privad does 
not change the number of ads viewed by the user; based 
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on the Bing trace we estimate the client’s upload traffic 
will be less than 20kB per day on average. 

Consequently, we estimate the broker will send around 
100kB and receive around 20KB per client per day, while 
the dealer acting as a proxy will send (and receive) 
around 120kB per client per day. While broker network 
overhead is more than today, the Privad broker trades-off 
network for lower processing overhead. There is, how- 
ever, no simple comparison of Privad broker processing 
overhead with that of existing systems. Todays systems 
are synchronous: they request a small number of ads fre- 
quently, and ad selection plus auction plus ad delivery 
must occur in milliseconds. Privad is asynchronous: a 
large number of ads are requested infrequently, and these 
do not have to be delivered immediately (overhead quan- 
tified in Section 6). Thus comparing overall broker costs 
depends, among other factors, on the reduction in broker 
processing overhead and corresponding reduction in dat- 
acenter provisioning costs, versus bandwidth costs. As 
for the dealer, the network overhead works out to less 
than 88MB per user per year. Assuming the dealer leases 
datacenter resources at market prices, this amounts to 
less than $0.01 per user per year (based on current Ama- 
zon EC2 pricing [2]). 


5 Implementation and Pilot Deployment 


We have implemented the full Privad system and de- 
ployed it on a small scale. The system comprises a 
client implemented as a 210KB addon for the Firefox 
web browser, a dealer, and a broker. Out of the 11K to- 
tal lines of code, the dealer consists of only 700 lines — 
well within limits of what can be manually audited. 

We have deployed Privad with a small group of users 
comprised primarily of 2083 volunteers” we recruited us- 
ing Amazon’s Mechanical Turk service [1]. The primary 
purpose of the deployment is to convince ourselves that 
Privad represents a complete system. To this end the de- 
ployment exercises all aspects of Privad including user 
profiling (by scraping the user’s Facebook profile and 
Google Ad Preferences), pub-sub ad dissemination, GSP 
auctions, view/click reporting, and basic click-fraud de- 
fense. For test ad data we scrape and re-publish Google 
ads through our system; since we lack targeting informa- 
tion for these ads, we target randomly. The system has 
been in continuous operation since Jan 1, 2010, with over 
271K ads viewed and 238 ads clicked as of Jan 6, 2011. 

The primary implementation challenge is the effort re- 
quired to scrape webpages for profiling purposes. Face- 
book’s and Google’s layout changed on multiple occa- 


*Users were offered an average one-time reward of $0.40 (for the 
1 minute it took on average to install the addon) with mechanisms in 
place to prevent cheating. While users were required to leave the addon 
installed for at least a week to get paid, most users either forgot about 
it or chose to leave it installed for longer. As of Jan 6, 2011, 429 users 
still have the addon installed. 
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sions during our deployment, which required us to up- 
date the client code (using the addon’s autoupdate mech- 
anism). We are presently working on a higher-level lan- 
guage (and interpreter) for scraping webpages that will 
allow us to react more quickly to website changes. 


6 Experimental Evaluation 


We use microbenchmarks to evaluate our system at scale. 

Broker: We benchmark first the performance of sub- 
scribe and report messages at the broker since they in- 
volve public-key operations. Without optimizations, as 
expected, performance is bottlenecked by RSA decryp- 
tions. While crypto optimizations could be offloaded 
to hardware [18], since the broker is in any event un- 
trusted, we additionally have the option of offloading to 
idle (untrusted) clients in the system (without impacting 
privacy guarantees). With this optimization, the broker 
needs only perform symmetric-key (AES) and hashing 
(SHA1) operations, which can be done at line speed us- 
ing dedicated hardware [22]. Our software-based imple- 
mentation achieved a throughput of 6K subscribe and 
report requests per second (on a single core of a 3GHz 
workstation), can publish 8.5K ads per second, and per- 
form around 30K auctions per second. We note that re- 
quest throughput in our broker is in the same ballpark 
as production systems today (based on the traces men- 
tioned earlier); although this is somewhat of an apples- 
to-oranges comparison since brokers in Privad are much 
simpler. 

In all cases the measured performance did not depend 
on the number of subscriptions or unique ads since all 
lookups at the broker are O(1); all runtime state (sub- 
scriptions, ads) is cached in memory and backed by per- 
sistent storage. The broker is designed with no shared 
state so it can trivially scale out to multiple cores. 

Dealer: Our dealer can forward 15K requests per sec- 
ond (on the same hardware) in both directions, which is 
sufficient for handling nearly 200K online clients (based 
on request rates from our deployment). The bottleneck is 
due to client-side polling which arises from implement- 
ing Privad’s asynchronous protocols on top of a request- 
response based transport (HTTP). With the emerging 
WebSockets standard [16], we believe we can eliminate 
this polling and support well over a million clients per 
dealer core. 

Client: Finally we focus on how Privad improves 
a user’s web browsing experience by eliminating net- 
work round-trips in the critical path of rendering web- 
pages. Figure 6 compares Privad performance to exist- 
ing ad networks. The figure compares the delay added 
for both populating ad boxes (on the 20 most popular 
sites as ranked by Alexa), and for completing the redi- 
rect to the advertiser webpage after a click. For Privad, 
we measured the time taken to populate ad boxes as we 
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Figure 6: Privad eliminates network RTTs for showing ads, 
and reporting clicks. Whiskers for Privad show performance as 
the number of (relevant) ads in the client’s database scales to 
1 million. Whiskers and boxes for existing ad networks show 
minimum and maximum latencies, and quartiles. 


scale the number of (relevant) ads cached in the client 
database. As mentioned, we estimate the typical num- 
ber of cached ads to be between 10K (average) to 1OO0K 
(worst case); we benchmark with a factor of ten margin. 
As one might expect, our client implementation outper- 
forms existing ad networks since displaying ads requires 
only local disk access. Our client can populate ad boxes, 
based on keywords or website context, in 31 ms. In exist- 
ing networks, we found the delay was dominated by the 
ad selection process; downloading the actual ad content 
(e.g. 30kB flash file) took less than 2ms. Doubleclick, 
which to our knowledge does not perform demographic 
or context sensitive advertising, took 129ms in the me- 
dian case, and Google, which does perform context sen- 
sitive advertising, took 670ms. With regards to reporting 
clicks, existing ad networks must perform a synchronous 
redirect through the ad network, which consumes several 
RTTs. Since Privad reports clicks asynchronously (when 
browser is idle), the redirect is unnecessary, thus allow- 
ing much faster advertiser page-loads. 


Our client scrapes webpages, pre-fetches ads, con- 
ducts auctions, and sends reports in the background. 
Messages that require public-key encryptions take be- 
tween 68ms (on a workstation) to 160ms (on a net- 
book) to construct, but since they are performed when 
the browser is idle, they are imperceptible to the user. 
The client uses negligible memory since ads are stored 
on disk; there is no appreciable change in the browser’s 
memory footprint whether the client is enabled or dis- 
abled. During our 12 month deployment, we have not 
received any negative feedback, performance related or 


otherwise, from users°. 


3or, for that matter, positive feedback. 


7 Privacy Analysis 


Broadly speaking, Privad uses technological means to 
protect user privacy. Privad provides privacy through 
unlinkability [28] (described below), and uses the dealer 
mechanism to ensure this. It is worth considering briefly 
alternative design points that we opted against. 

Considering it is believed to be impossible to design 
systems that are secure against covert channels and col- 
lusion [17,26], neither we, nor privacy advocates expect 
bulletproof privacy. Privacy advocates instead have the 
much softer requirement that “individuals [be] able to 
control their personal information”, and if privacy is vio- 
lated, the ability to “hold accountable organizations [re- 
sponsible]” [5]. Privad trivially satisfies the first require- 
ment by storing all personal information on the user’s 
computer and assuring unlinkability. In the absence of 
covert channels or collusion, this prevents any organi- 
zation from learning about users, thereby preventing pri- 
vacy violations in the first place. In the presence of covert 
channels or collusion, the organization’s willing and ex- 
plicit circumvention of technological privacy safeguards 
strongly implies malicious intent (in the legal sense) to 
which they can be held accountable. 

As a result, the oversight task for privacy advocates is 
reduced from detecting any kind of privacy violation, in- 
cluding those purely internal to a broker, to detecting col- 
lusion and the use of covert channels. As we discuss be- 
low, Privad incorporates existing (and future) techniques 
to disrupt or detect covert channels through the reference 
monitor mechanism and careful protocol design. Detect- 
ing collusion is easier with the dealer mechanism as com- 
pared to, say, a mixnet like TOR [6]. Not only does TOR 
not meet business needs by giving up any visibility into 
click fraud, TOR’s threat model is a poor match for Pri- 
vad since a single entry node colluding with the broker 
can compromise the anonymity of all users connecting 
through that node [3]. In contrast to mixnet nodes, a 
dealer organization (e.g. datacenter operators) can be 
contractually bound, and its non-collusionary involve- 
ment be monitored by privacy advocates. This model is 
in use today and is approved for instance by the European 
privacy certification organization Europrise [10]. 

Given that Privad relies to an extent on accountabil- 
ity, one might ask why a purely regulatory solution 
doesn’t suffice. There are two problems. First, en- 
trenched players like Google have strong incentives, lob- 
bying power, and the capital needed to maintain the sta- 
tus quo. Indeed many parallels can be drawn to the 
network-neutrality battle where powerful ISPs success- 
fully resisted new regulations threatening their business 
model [33]. Second, even if regulations were passed, en- 
forcement would require third-party auditing of all bro- 
ker operations, which is impractical due to the complex- 
ity and scale of these systems. Market forces, such as 
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competition from a startup offering better ROI to adver- 
tisers through deeper personalization (with backing from 
privacy advocates), can arguably effect change more eas- 
ily. 

In the remainder of this section we first define infor- 
mally what we mean by user privacy and our trust as- 
sumptions. We then address the technical measures per- 
taining to covert channels. We then consider a series of 
attacks on the system, the defense to the attack, and a 
discussion of the extent to which the defense truly solves 
the attack. 


7.1 Defining Privacy 


Our privacy goals are based on Pfitzmann and K6hn- 
topp’s definition of anonymity [28] which is unlinkabil- 
ity of an item of interest (IOI) and some logical user iden- 
tifier. Privad has three types of IOI; IP address, and inter- 
est attributes and demographic attributes. Pfitzmann and 
K6hntopp consider anonymity in terms of an anonymity 
set, which 1s the set of users that share the given item of 
interest — the larger this set, the “better” the anonymity. 
Personally Identifiable Information (PII) is information 
for which the anonymity set comprises a single (or a very 
small number of) elements; e.g., the IP address is PII. Ex- 
amples of non-PII anonymity sets in Privad include: the 
set of users that join a pub-sub channel, the set of users 
that visit a given publisher, and the set of users that view 
or click a given ad (i.e. probably share some or all of the 
ad’s attributes). 

In our definition of privacy we draw a distinction be- 
tween IOI that contain PII and IOI that do not, as follows: 


P1) Profile Anonymity: No single player can link any 
PI for a user with any attribute in the user’s profile. 

P2) Profile Unlinkability: No single player can link to- 
gether more than a threshold number of (non-PII) 
profile attributes for the same user, which would 
otherwise allow them to, over time, construct a 
unique profile that could be deanonymized using ex- 
ternal databases. 


Existing ad networks, of course, satisfy neither Profile 
Anonymity nor Profile Unlinkability. 

Note that for Profile Unlinkability we use “number of 
profile attributes” rather than the size of the anonymity 
set even though the former doesn’t per se map directly 
onto the latter. Different attributes imply different sizes 
of anonymity sets (e.g., music vs. sports.skiing.cross- 
country). Ideally, Privad would dynamically guarantee a 
minimum anonymity set size at runtime, but this is not 
possible because any such approach is easily attacked 
with Sybils [7], e.g. a botnet of clients masquerading 
as members of that set. It is possible, however, to esti- 
mate offline the rough expected anonymity set size for an 
attribute with outside semantic knowledge. 
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The approach towards privacy in Privad is then as fol- 
lows: 1) offline semantic analysis by privacy advocates 
establishes per-message thresholds for Profile Unlinka- 
bility; this is enforced at runtime by the monitor as we 
discuss later in Attack A9. 2) Mechanisms in Privad en- 
sure multiple messages from the same client cannot be 
linked together, and therefore the system as a whole can- 
not violate Profile Unlinkability. And 3) since the dealer 
is the only party that learns PH (IP address) and nothing 
else about the user, Profile Anonymity is trivially satis- 
fied. 


7.2 Trust Assumptions 


The user trusts only the reference monitor; the client soft- 
ware, dealer and broker are all untrusted. Privacy advo- 
cates are expected to play a watchdog role by validating 
the reference monitor, monitoring dealer operation, and 
running honeyfarms to detect covert channels. The bro- 
ker does not trust clients, dealers, or reference monitors. 
Attack A4 below discusses malicious dealers including 
those that may engage in click fraud. Privad does not 
modify any interactions users or brokers have with pub- 
lishers or advertisers. The advertiser and publisher, like 
today, can see the user’s browsing behavior on their own 
site, and trust the broker to perform accurate billing. 


7.3 Covert Channels 


A malicious broker may distribute a malicious client that 
attempts to leak data using covert channels. The band- 
width of covert channels is reduced by bounding non- 
determinism in messages. Note first of all that the covert 
channel must come from Privad application message 
fields, not encapsulating protocol fields such as those in 
the crypto messages. This is because it is the reference 
monitor that takes care of crypto and message delivery 
functions. In addition, it is also the monitor that gener- 
ates the one-time shared keys (for subscriptions) which 
otherwise represent the best covert channel opportunity. 
Note next that the values of most message fields are 
driven by user behavior (outside client-control) and are 
subject to audit by privacy advocates or users. This in- 
cludes the channel ID in subscriptions, and the type, pub- 
lisher ID, and ad ID in reports, which together compose 
all remaining bits in subscribe and report messages. The 
next best opportunity for a covert channel would come 
from the user score in the GSP auction message (Fig- 
ure 4). That is because this is the only client-controlled 
message field, albeit only 2 or 3 bits in size since the 
user score need only be in a small range. This bounds 
the information that can be leaked by a single message. 
The Privad protocol and reference monitor make it 
hard to construct a covert channel across multiple mes- 
sages. Since messages from the same source cannot, by 
design, be linked based on content, the attacker must use 
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some time-based watermarking technique (e.g., [32]). 
The reference monitor adds arbitrary delay or jitter to 
messages to disrupt such attempts. For this reason, all 
Privad protocols are designed to be asynchronous and use 
soft-state without any acknowledgments. 


A computer system cannot completely close all covert 
channels, but by at least making it possible for privacy- 
advocates to detect them, and by establishing malicious 
intent by requiring attackers to circumvent multiple tech- 
nical hurdles, Privad significantly increases the risk of 
being caught and thus decreases the utility of covert 
channels. This is in contrast to today where third-parties 
can neither detect privacy-violations, nor establish intent 
when violations are revealed [29]. 


7.4 Attacks and Defenses 


This section outlines a set of key attacks on user privacy. 
Space constraints prevent us from discussing in detail at- 
tacks on advertiser and broker privacy. We do however 
briefly note the following. Broker privacy, in the form 
of trade secrets for profiling mechanisms, is maintained 
because client software is a black-box that does not need 
to be audited; and the broker can use the same legal and 
technical mechanisms used by desktop software compa- 
nies today. Advertiser privacy is weakened because it is 
slightly easier to learn an ad’s targeting information as 
compared to today’s systems. Privad does not however 
change the ease with which an attacker can learn an ad- 
vertiser’s bids. 


7.4.1 Attacker at Client 


Attack Al: The attacker installs malware on a user’s 
computer which provides the profile information to the 
attacker or otherwise exploits it. 


Defense D1: Privad does not protect against malware 
reading the profile it generates. Our general stance is that 
even without Privad, malware today can learn anything 
the client is able to learn, and so not protecting against 
this threat does not qualitatively change anything. Hav- 
ing said that, obviously the existence of the profile does 
make the job of malware easier. It saves the malware 
from having to write its own profiling mechanisms. It 
also allows the malware to learn the profile more quickly 
since it doesn’t have to monitor the user over time to 
build up the profile. 


Ultimately what goes into the profile is a policy ques- 
tion that privacy advocates and society need to answer. 
Clearly information like credit card number, passwords, 
and the like have no place in the profile (though malware 
can of course get at this information anyway). Whether 
a user has AIDS probably also does not belong there. 
Whether a user is interested in AIDS medication, how- 
ever, arguably may belong in the profile. 


Indeed, there are pros and cons to keeping profile con- 
tents open. On the pro side, this makes it easier for pri- 
vacy advocates to monitor the client and to an extent bro- 
ker operation. On the con side, it makes life easier for 
malware. One option, if the operating system supports it, 
is to make the profile available only to the client process 
(e.g. through for instance SELinux [25]). This would 
protect against userspace malware, but not rootkits that 
compromise the OS. Another option is to leverage trusted 
hardware (e.g. [31]) when available. How best to handle 
the profile from this perspective is both an ongoing re- 
search question and a policy question. 


7.4.2 Attacker at Dealer 


A2: The attacker attempts to learn user profile informa- 
tion by reading messages at the dealer. 

D2: The dealer proxies five kinds of messages: sub- 
scribe, publish, auction request and response, and re- 
ports. Of these, the dealer cannot inspect the contents 
of subscribe, report, and publish messages since the first 
two are encrypted with the broker’s public key, and the 
last is encrypted with a symmetric key that is exchanged 
via the encrypted subscribe message. Auction messages, 
which are unencrypted, contain a random single-use [7d 
that identifies the ad at the broker and the client (ex- 
changed over the encrypted publish message), but is 
meaningless to the dealer. 

A3: The attacker injects messages at the dealer in order 
to learn a user’s profile information. 

D3: The dealer cannot inject a fake publish message 
since it would not validate at the client after decryption. 
If the dealer injects a fake subscribe message, all result- 
ing publish messages would be discarded by the client 
since the client would not have a record of the subscribe 
or the associated key. The dealer cannot inject fake auc- 
tion messages since the client would not have a record of 
the Jid. The dealer could reorder the auction result, but 
would not learn which ad the client viewed or clicked 
since reports are encrypted. The dealer injecting fake re- 
ports has no impact on the client; it is, however, identical 
to dealer-assisted click-fraud, which we consider next. 
A4: The dealer itself engages in click-fraud, or other- 
wise does not comply with the broker’s request to block 
fraudulent clients. 

D4: The broker can independently audit that the dealer 
is Operating as expected both actively and passively. The 
broker can passively track view/click volumes, and his- 
torical statistics on a per-dealer basis to identify anoma- 
lous dealers. Additionally the broker can passively mon- 
itor the rate of fraudulent clicks (e.g. using bait ads) 
on a per-dealer basis. The broker can detect suspicious 
dealer behavior if after directing dealers to stem a par- 
ticular attack the rate of fraudulent clicks through one 
dealer does not drop (or drops proportionally less) than 
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for other dealers. Finally, the broker can actively test a 
dealer by launching a fake click-fraud attack from fake 
clients, and ensuring the dealer blocks them as directed. 
AS: A particularly sneaky attack aimed at learning which 
users send view or click reports for a given publisher (or 
advertiser) is as follows. The dealer first launches a click- 
fraud attack on the given publisher (or advertiser). The 
broker identifies the attack. When a user sends a legiti- 
mate report for that publisher (or advertiser), the broker 
mistakenly suspects the report as fraudulent and asks the 
dealer to block the client. The dealer can now infer that 
the encrypted report it proxied must have matched the 
attack signature it helped create. 

DS: First note that this attack applies only in the sce- 
nario where there are no other click-fraud attacks taking 
place other than the one controlled by the dealer (and the 
dealer somehow knows this). As part of the Privad pro- 
tocol (Figure 5), however, the dealer does not learn how 
many attacks are taking place (even if there is only one 
ongoing attack), or which publishers or advertisers are 
under attack, or which attack the client was implicated 
in. Thus there is too much noise for the dealer to reach 
any conclusions about implicated clients. 


7.4.3 Attacker at Broker 


A6: The broker attempts to link multiple messages from 
the same user using passive or active approaches. 

D6: We are only concerned with subscribe and reports 
messages since the dealer mixes auction requests. Pri- 
vad messages do not contain any PII, unique identi- 
flier, or sequence number. The monitor ensures the per- 
subscription symmetric keys are unique and random. 
Additionally, the monitor disrupts timing based correla- 
tion, for instance by staggering bursts of messages (e.g. 
when the client starts up, or views a website with many 
adboxes). Altogether these defenses prevent the broker 
from linking two subscriptions, or two reports from the 
same user. 

The broker may attempt to link a report with a sub- 

scription. The only way to do this is by publishing an ad 
with a unique ad ID, and waiting for a report with that ID. 
Privacy advocates can detect this by running honeyfarms 
of identical clients and ensuring ad IDs are repeated. 
A7: During the GSP auction mechanism the broker 
attempts to link two ads published to the same client 
through different pub-sub subscriptions, thereby effec- 
tively linking two subscriptions. 
D7: The property of the mix constructed at the dealer is 
such that tuples from the same client but for ads on dif- 
ferent pub-sub channels are indistinguishable from tuples 
from two different clients each subscribed to one of the 
channels. The pub-sub protocol provides the same prop- 
erty. Thus the broker doesn’t learn anything new from 
the auction protocol. 
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Note the broker can obviously link which ads it sent 
for the same subscription, but cannot determine which of 
them actually matched the user. This is because the client 
submits all ads received on a channel for auction whether 
or not it matched the user (enforced by the monitor); bo- 
gus user scores for non-matching ads prevents the broker 
from distinguishing between the two. 

A8: The broker masquerades as a dealer and hijacks the 
client’s messages thus learning the client’s IP address. 
Possible methods of hijacking the traffic may include 
subverting DNS or BGP. 

D8: The solution is to require Transport Layer Security 
(TLS) between client and dealer, and to use a trusted cer- 
tificate authority. The reference monitor can insure that 
this is done correctly. 

A9: The broker creates a channel with a large enough 
number of attributes that an individual user is uniquely 
defined. When that user joins the channel, the broker 
knows that a user with those attributes exists. This could 
be done for instance to discover the whereabouts of a 
known person or to discover additional attributes of a 
known person. For instance, if n attributes are known to 
uniquely define the person, then any additional attributes 
associated with a joined channel can be discovered. 

D9: It is precisely for this reason that pub-sub chan- 
nels definitions are static, well-known, and public (Sec- 
tion 3.1). Privacy advocates can look at channel def- 
initions and ensure they meet a minimum expected 
anonymity set size. Additionally, the monitor can filter 
out channel definitions when the attributes for that chan- 
nel exceed some set threshold. 

Similar restrictions apply to the set of profile attributes 
an ad can target, with one difference. In the context 
of second-price auctions, the broker needs to necessar- 
ily link adjacent ads. Thus the monitor needs to enforce 
that the sum of attributes of the two ads involved in a 
click-report is below the threshold. 

Note the ability to link two ads applies only to clicks. 
View reports do not contain second price information 
since otherwise a page with many ads would allow the 
broker to link each consecutive pair of ads, and therefore 
a whole chain of ads. While the same problem exists if 
the user were to click on the whole chain of ads, since 
clicks are rare this is not a big concern. 


$ Related Work 


There is surprising little past work on the design of pri- 
vate advertising systems, and what work there is tends to 
focus on isolated problems rather than a complete system 
like Privad. This related work section focuses only on 
systems that target private advertising per se, and mainly 
concentrates on the privacy aspects of those systems. 
In particular, we look at Juels [20], Adnostic [30], and 
Nurikabe [24]. 
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Juels by far predates the other work cited here, and in- 
deed is contemporary with the first examples of the mod- 
ern advertising model (i.e. keyword-based bidding). As 
such, Juels focuses on the private distribution of ads and 
does not consider other aspects such as view-and-click 
reporting or auctions. Privad’s dissemination model is 
similar to Juels’ in that a client requests relevant ads 
which are then delivered. Indeed, Juels’ trust model is 
stronger than Privad’s. Juels proposes a full mixnet be- 
tween client and broker, thus effectively overcoming col- 
lusion. We believe this trust model is overkill, and that 
his system pays for this both in terms of efficiency and in 
the mixnet’s inability to aid the broker in click fraud. 

Like Juels and Privad, Adnostic also proposes client- 
side software that profiles and protects user privacy. 
When a user visits a webpage containing an adbox, the 
URL of the webpage is sent to the broker as is done to- 
day. The broker selects a group of ads that fit well with 
the ad page (they recommend 30), and sends all of them 
to the client. The client then selects the most appropriate 
ad to show the user. The novel aspect of Adnostic is how 
to report which ad was viewed without revealing this to 
the broker. Adnostic uses homomorphic encryption and 
efficient zero-knowledge proofs to allow the broker to 
reliably add up the number of views for each ad without 
knowing the results (which remain encrypted). Instead, 
they send the results to a trusted third-party which de- 
crypts them and returns the totals. By contrast to views, 
Adnostic treats clicks the same as current ad networks: 
the client reports clicks directly to the broker. 

The privacy model proposed by Adnostic is much 
weaker than that of Privad. Privad considers users’ web 
browsing behavior and click behavior to be private, Ad- 
nostic does not. Indeed, we would argue that the knowl- 
edge that Adnostic provides to the broker allows it to 
very effectively profile the user. A user’s web browsing 
behavior says a lot about the user interests and many de- 
mographics. Knowledge of which ads a user has clicked 
on, and the demographics to which that ad was targeted, 
allow the broker to even more effectively profile the user. 
Finally, the user’s IP address provides location demo- 
graphics and effectively allows the broker to identify the 
user. Adnostic’s trust model for the broker is basically 
honest-and-not-curious. If that is the case, then today’s 
advertising model should be just fine. 


Nurikabe also proposes client-side software that pro- 
files the user and keeps the profile secret. With Nurik- 
abe, the full set of ads are downloaded into the client. 
The client shows ads as appropriate. Before clicking any 
ads, the client requests a small number of click tokens 
from the broker. These tokens contain a blind signature, 
thus allowing the tokens to later be validated at the bro- 
ker without the broker knowing who it previously gave 
the token to. The user clicks on an ad, the click report 


is sent to the advertiser along with the token. The adver- 
tiser sends the token to the broker, who validates it, and 
this validation is returned to the client via the advertiser. 

Nurikabe has an interesting privacy model. They ar- 
gue that, since the advertiser anyway is going to see the 
click, there is no loss of privacy by having the advertiser 
proxy the click token. By taking this position, Nurik- 
abe avoids the need for a separate dealer. Our problem 
with this approach is that Nurikabe basically gives up on 
the problem of privacy from the advertiser altogether. It 
cannot report views without exposing this to the adver- 
tiser, thus reducing user privacy from the advertiser even 
more than today. View reporting is important, in part be- 
cause it allows the advertiser to compute the CTR and 
know how well its ad campaign is going. Nurikabe also 
gives up any visibility into click fraud. Nurikabe miti- 
gates click fraud only by rate limiting the tokens it gives 
to every user. As a result, the attacker need only Sybil 
itself behind a botnet and solve CAPTCHAs to launch a 
massive click-fraud attack which cannot be defended. Fi- 
nally, in [13] the authors find through ad measurements 
that there are simply far too many ads (with too much 
churn) to be able to distribute them all to all clients. 

Some aspects of Privad have previously been explored 
in [13,15]. The seed idea behind Privad was planted 
in [15], a short paper revisiting the economic case for ad- 
vertising agents on the endhost (..e., distinguishing “ad- 
ware” from “badware’”’), which presents a rough sketch 
of privacy-aware click reporting. In [13] we use mea- 
surement data to guide our design and explore the feasi- 
bility of building such a system. This paper presents the 
resulting detailed design, experimental evaluation, and 
security analysis of a full advertising system. 


9 Summary and Future Directions 


This paper describes a practical private advertising sys- 
tem, Privad, which attempts to provide substantially bet- 
ter privacy while still fitting into today’s advertising busi- 
ness model. We have designs and detailed privacy analy- 
sis for all major components: ad delivery and reporting, 
click fraud defense, advertiser auctions, user profiling, 
and optimizations for scalability. 

We are actively working on getting a better under- 
standing of a number of Privad components. Foremost 
among these are how best to do profiling, how best to run 
auctions, the bait approach to click-fraud, and privacy 
from the advertiser. Another important problem is how 
to allow brokers and advertisers to gather rich statistical 
information about user behavior in a privacy-preserving 
way. ‘Towards this end, we are looking at distributed 
forms of differential privacy. We are also working with 
application developers to deploy at Internet scale to give 
researchers a platform for experimenting with real users 
and advertisements. 
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Besides pursuing the technical aspects of Privad, we 
have discussed Privad with a number of privacy advo- 
cates and policy makers, and have applied for a Euro- 
prise privacy seal. We hope that Privad and other recently 
proposed private advertising systems spur a rich debate 
among researchers and privacy advocates as to the best 
ways to do private advertising, the pros and cons of the 
various systems, and how best to move private advertis- 
ing forward in society. 
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Abstract 


Online marketplaces are now a popular way for users to 
buy and sell goods over the Internet. On these sites, user 
reputations—based on feedback from other users con- 
cerning prior transactions—are used to assess the likely 
trustworthiness of users. However, because accounts 
are often free to obtain, user reputations are subject to 
manipulation through white-washing, Sybil attacks, and 
user collusion. This manipulation leads to wasted time 
and significant monetary losses for defrauded users, and 
ultimately undermines the usefulness of the online mar- 
ketplace. 

In this paper, we propose Bazaar, a system that ad- 
dresses the limitations of existing online marketplace 
reputation systems. Bazaar calculates user reputations 
using a max-flow-based technique over the network 
formed from prior successful transactions, thereby limit- 
ing reputation manipulation. Unlike existing approaches, 
Bazaar provides strict bounds on the amount of fraud that 
malicious users can conduct, regardless of the number 
of identities they create. An evaluation based on a trace 
taken from a real-world online marketplace demonstrates 
that Bazaar is able to bound the amount of fraud in prac- 
tice, while only rarely impacting non-malicious users. 


1 Introduction 


Online marketplaces like eBay, Overstock Auctions, and 
Amazon Marketplace enable buyers and sellers to con- 
nect regardless of each other’s location, allowing even 
the most esoteric of products to find a market. These 
marketplaces have greatly expanded the set of people 
who can act as a buyer or seller and, thus, can be viewed 
as democratizing commerce. These sites are extremely 
popular with users; in 2009, over $60 billion worth of 
goods was exchanged on eBay alone. 

This new freedom, however, does not come without 
challenges. Online marketplaces are known to suffer 
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from fraud, and often rely on user reputations—formed 
from the feedback provided by other users—in an ef- 
fort to mitigate the effects of malicious activities on their 
sites. For example, on eBay, potential buyers often ex- 
amine the reputation of the seller to determine the seller’s 
trustworthiness. In fact, it has been observed [13, 15, 19] 
that sellers with highly positive reputations tend to sell 
goods at a higher price when compared to sellers with 
lower reputations, demonstrating the central role that 
user reputations play in online marketplaces. Malicious 
buyers (who do not pay for goods purchased) and ma- 
licious sellers (who do not deliver the promised goods) 
quickly gain bad reputations and are avoided [11]. 

One challenge, however, is that accounts on online 
marketplaces are often free to create (usually only requir- 
ing filling out a form and solving a CAPTCHA [23]), to 
avoid discouraging potential users. As a result, reputa- 
tions derived from user feedback are still subject to three 
types of manipulation: 


e Malicious users whose accounts have a bad reputa- 
tion can effectively white-wash their reputation by 
creating a new account with a blank reputation. 


e Malicious users can collude by providing positive 
feedback on each other’s transactions, thereby 1m- 
proving both of their reputations.! 


e Malicious users can create fake identities, known as 
Sybils [7], and use these to provide positive feed- 
back on fictitious transactions between the various 
identities, thereby inflating their reputations. 


Reputation manipulation can lead to significant mone- 
tary losses for defrauded users. For example, a single 
malicious eBay user was recently found to have created 
260 different accounts, fabricated positive feedback, and 
stolen over $717,000 from over 5,000 users [24]. This 


'Tn fact, this type of abuse can be plainly viewed on eBay by search- 
ing for auctions that are selling “positive feedback.” As of this writing, 
350 such auctions exist for prices ranging from $0.01 to $0.99. 
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case is hardly unique: Another malicious eBay user was 
arrested after defrauding others of over $1 million [20]. 

In this paper, we propose Bazaar, a system that 
strengthens user reputations in online marketplaces in 
the face of collusion, white-washing, and Sybil attacks. 
Bazaar creates and maintains a risk network in order to 
predict whether potential transactions are likely to be 
fraudulent. The risk network consists of weighted links 
between pairs of users who have successfully conducted 
transactions in the past. When a transaction is about 
to be completed, Bazaar calculates the max-flow be- 
tween the buyer and seller; if it is lower than the amount 
of the transaction, the transaction is flagged as poten- 
tially fraudulent. Since Bazaar only needs to determine 
whether the max-flow is above a given value (instead of 
calculating the exact max-flow), Bazaar stores the risk 
network using a novel multi-graph representation. We 
demonstrate that this results in a substantial speed-up 
of Bazaar’s max-flow calculation while imposing only a 
modest storage overhead. 

Bazaar provides a number of useful security proper- 
ties: First, malicious users in Bazaar cannot conduct 
more fraud together than they could separately, and as 
a result, there is no incentive for malicious users to col- 
lude. Second, malicious users cannot gain any advantage 
from conducting Sybil attacks, and thus, there is no in- 
centive to create multiple identities. Third, Bazaar ex- 
plicitly allows users to create as many identities as they 
wish; this is sometimes a desired feature in online mar- 
ketplaces, where sellers may own multiple businesses or 
wish to maintain separate identities for different types of 
goods. Fourth, Bazaar provides a strict guarantee that 
each user can only defraud others by up to the amount of 
valid transactions the user has participated in, regardless 
of the number of identities the user possesses, thereby 
bounding the potential damage. 

We evaluate Bazaar using a trace collected from eBay, 
the largest online marketplace. We collected a 90-day 
history of five of the most popular categories on the eBay 
United Kingdom site, encompassing over 3 million users 
and 8 million auctions. Simulating Bazaar on this data 
set, we demonstrate that Bazaar successfully bounds the 
amount of fraudulent transactions that malicious users 
can conduct, while only rarely impacting the transactions 
that occur between non-malicious users. We demonstrate 
that if Bazaar had been deployed on eBay during the 90- 
day period and in the five categories we study, it would 
have flagged over £164,000 of auctions that eventually 
resulted in negative feedback as potentially fraudulent, 
substantially increasing the reliability of the online mar- 
ketplace. 

The rest of this paper is organized as follows. Sec- 
tion 2 describes the approaches that are currently taken to 
secure online marketplaces, and Section 3 provides more 
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detail on different types of fraud that are still present to- 
day. Section 4 describes the design of Bazaar in detail, 
and Section 5 details the multi-graph representation of 
the risk network. Section 6 presents an evaluation of 
Bazaar. Section 7 details related work and Section 8 con- 
cludes. 


2 Background 


Online marketplaces often use site-specific mechanisms 
for fraud prevention, but many of these can be reduced to 
a few simple techniques: 


Making joining the market difficult Certain market- 
places only allow trusted users or organizations to par- 
ticipate as sellers, often requiring upfront fees or ac- 
counts backed by difficult-to-forge financial information. 
An example of such such an approach is Amazon Mer- 
chants [3], which requires bank account information, a 
$40-per-month fee, and pre-approval for listing high- 
fraud-risk goods. However, by making it more difficult 
to join, this approach reduces the usefulness of the mar- 
ketplace and severely restricts the population of sellers. 


Using a trusted broker In some marketplaces, a mid- 
dleman participates in the transaction and holds payment 
until the buyer is satisfied with the transaction. For exam- 
ple, on eBay, there are escrow services that hold money 
for transactions until the buyer has received the good. 
However, brokers typically charge a fixed fee and a per- 
centage of the sale,” increasing the transaction cost and 
making escrow practical only for expensive goods (rep- 
resenting a small minority of the goods on typical mar- 
ketplaces). 


Requiring in-person transactions Other marketplaces 
such as Craigslist require buyers and sellers to be within 
the same geographical area, ensuring that the participants 
can meet in person to complete a transaction. This ap- 
proach allows buyers to inspect goods, and sellers to ver- 
ify payment, before going through with the transaction. 
However, this approach also severely restricts who is able 
to buy and sell goods from each other (as the buyer and 
seller must live close to each other), limiting its useful- 
ness to local marketplaces. 


Providing insurance Certain marketplaces offer buyer 
and seller insurance programs, either by default or for a 
fee. However, coverage is generally limited to certain 
geographic regions and the cost of the insurance pay- 
outs and program administration results in higher fees 
for marketplace users. Nevertheless, the information that 
Bazaar provides can be viewed as an estimate of risk be- 


*For example, eBay’s recommended escrow service charges a min- 
imum of $22 and up to 3% of the transaction cost. 
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tween two parties, and can therefore be used as an input 
when choosing the appropriate the insurance premium. 


Paying via trusted services Because certain payment 
methods (e.g., money orders) are difficult to recover, 
many marketplaces suggest or require that trusted on- 
line payment services (e.g., PayPal) be used. Ideally, 
such services would link accounts to real-world financial 
information, making the creation of multiple accounts 
difficult. However, this is not the case: For example, 
receiving money with a PayPal account only requires 
an email address (although financial information is re- 
quired to withdraw funds). Thus, malicious users can 
receive money with networks of email-backed accounts, 
and then send that money to the single, “real” account 
that is able to withdraw money. 


Leveraging feedback Finally, many online market- 
places use feedback provided by users who have par- 
ticipated in transactions. For example, eBay’s feedback 
mechanism calculates a score for each user, consisting 
of the amount of positive feedback minus the amount of 
negative feedback. Users with highly positive feedback 
scores are considered to be more trustworthy, and have 
been observed to sell goods for higher prices [13, 15,19]. 
This approach has the advantage of not restricting mar- 
ketplace membership and allowing any buyer and seller 
to participate in a transaction. However, as we will ob- 
serve in the next section, using feedback is often subject 
to manipulation by malicious users. 


Ideally, we would like to prevent fraud without un- 
necessarily restricting participation in the online market- 
place. The first four approaches above artificially restrict 
the marketplace by making it either harder to join, more 
expensive to use, segmenting it based on geography, or 
spreading the cost of fraud to all users. Thus, we focus 
on the last approach, leveraging feedback, for the design 
of Bazaar and present a design that is not subject to the 
manipulation of existing approaches. Focusing on user 
feedback also has the advantage that is the mechanism 
used by the largest online marketplaces, such as eBay, 
meaning Bazaar could be directly applied to such sites. 


3 Examples of malicious behavior 


We motivate the design of Bazaar by examining several 
types of fraud that have been observed in online market- 
places today. The eBay dataset that we use for illustra- 
tion is fully described in Section 6, however, our purpose 
here is simply to provide a few motivating examples. In 
this section, we focus on malicious sellers who attempt 
to defraud buyers, as sellers are largely protected from 
malicious buyers by being allowed to verify payment be- 
fore shipping the good. To define the fraud we observe, 


we look at various sellers’ feedback history, consisting 
of entries recording whether the buyer was satisfied with 
the transaction. 

For clarity, we begin by examining the feedback his- 
tory of a typical seller, shown in Figure 1 (a). Even 
though over 99% of the seller’s feedback is positive, a 
few items of negative feedback can be observed. A cer- 
tain low level of negative feedback is expected even for 
non-malicious sellers, as some buyers may have been un- 
satisfied with their purchase (e.g., due to the good being 
lost or damaged in transit, a miscommunication between 
the participants, or buyer’s remorse). We will use similar 
timeline diagrams throughout the rest of this section. 


3.1 Leaving the marketplace 


One of the most common types of fraud occurs when a 
seller participates in the marketplace as a non-malicious 
user for a period of time, and then turns malicious (often 
by starting to conduct transactions without ever shipping 
the goods). As a result, the unsuspecting buyers who 
have not yet received their goods are defrauded. This 
type of fraud can be detected once the buyers begin to 
provide negative feedback, serving as a warning to oth- 
ers. However, malicious users often take advantage of 
the “window of opportunity” before the negative feed- 
back appears: They can advertise and accept payment 
for a large number of goods before any user realizes that 
a fraud has occurred. 

An example of such a malicious seller is shown in Fig- 
ure | (b). Towards the end of the seller’s timeline, he lists 
a significant number of goods that are never delivered 
and eventually result in negative feedback. In fact, this 
user made significantly more money in aggregate from 
the fraudulent transactions than from the non-fraudulent 
transactions. The underlying problem is that in-progress 
transactions are not counted against a seller’s reputa- 
tion, enabling malicious users to establish a reputation, 
defraud users with the window of opportunity, and then 
re-join the site with a new account. 


3.2 Hiding fraud in the noise 


As an alternative to leaving the marketplace, malicious 
users have also been observed to “hide the fraud in the 
noise” by participating in many non-fraudulent trans- 
actions, but conducting fraudulent transactions for (rel- 
atively) expensive goods. As a result, their feedback 
history has only a small amount of negative feedback, 
and only a close inspection of the transaction values re- 
veals the fraud. An example of a malicious user con- 
ducting such fraud is shown in Figure | (c), where 
the user made more money through the two fraudulent 
transactions than through the hundreds of non-fraudulent 
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Figure 1: Auction feedback history over time for three eBay sellers: (a) a typical seller, (b) a malicious seller who 
leaves the marketplace, and (c) a malicious seller who hides the fraud in the noise by conducting a few, large fraudulent 
transactions. Positive feedback is shown in green, neutral feedback in blue, and negative feedback in red and below 
the line. The size of each bar correspond to the log of the value of the auction. 


transactions. The underlying problem is that the value 
of transactions is not considered when determining a 
seller’s reputation, enabling malicious users to conduct 
a high-value fraudulent transactions with the same effec- 
tive penalty (one piece of negative feedback) as a low- 
value fraudulent transaction. 


3.3. Conducting fictitious transactions 


Malicious users have also been observed to conduct fic- 
titious transactions and provide fictitious positive feed- 
back. The ultimate goal of these transactions is not to 
sell a good, but rather, to improve the user’s feedback 
score, making the user look more like an non-malicious 
user. For example, numerous auctions on eBay are la- 
beled with “Positive Feedback Guaranteed.’ Often, these 
auctions ostensibly offer a copy of a digital picture or 
other token item, so as to appear as a legitimate auction. 

Thus, it is easy for a malicious user to arbitrarily ma- 
nipulate his feedback score by adding spurious positive 
feedback, so as to appear as a legitimate seller. The un- 
derlying problem is that feedback counts the same, re- 
gardless of the other user providing the feedback. This 
allows malicious users to conspire to inflate each other’s 
feedback score (or, a single malicious user to do the same 
via a Sybil attack). 


3.4 Summary 


In this section, we described three of the most common 
types of reputation manipulation that are present in the 
online marketplaces of today. In the next section, we de- 
scribe the design of Bazaar, which addresses each type 
of manipulation by (a) considering outstanding transac- 
tions, (b) taking into account the value of transactions 
with positive and negative feedback, and (c) discriminat- 
ing between different users’ feedback, in order to pre- 
vent malicious users from artificially inflating their repu- 
tation. 
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4 Bazaar design 


We now describe the design of Bazaar. 


4.1 Overview 


Bazaar is intended to augment an online marketplace, run 
by a marketplace operator, where buyers and sellers may 
have no previous relationship and accounts are free to ob- 
tain. In such systems, buyers must rely on the reputation 
of the sellers, represented by feedback from other buy- 
ers, to distinguish between non-malicious and malicious 
users. Thus, the goal of Bazaar is to protect buyers from 
malicious sellers who manipulate their reputation so as 
to appear non-malicious. Additionally, we aim to keep 
the existing model and basic user operations, while sig- 
nificantly reducing the vulnerability to fraud. By doing 
so, Bazaar serves as a drop-in component applicable to 
numerous marketplaces. 

Now, let us introduce a few definitions that we use for 
the remainder of this section. A user corresponds to an 
actual person in the offline world. An identity is an online 
account with a particular username associated with it. A 
user can have a potentially arbitrary number of identities. 
A transaction is an event where two identities agree to a 
sale, which has some value. Note that both identities in a 
transaction may correspond to the same user. 

Bazaar relies on two insights. First, successful trans- 
actions between different users require significant effort 
and risk for both parties. Both users are trusting the other 
to complete the transaction, by providing payment or de- 
livering the good. We refer to this as shared risk be- 
tween two users. Second, once a transaction has been 
successfully completed, the two users are more likely to 
enter into a transaction together in the future. Note, how- 
ever, this risk in not unbounded, and is dependent on the 
type of transaction that has occurred: The amount of risk 
that two users are willing to undertake is likely propor- 
tional to the amount of risk that has been successfully 
rewarded. 
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4.2 Risk network 


We view a successful transaction as linking two identi- 
ties in an undirected fashion, where the weight of the 
link is the aggregate monetary value of all success- 
ful transactions—successfully rewarded shared risk— 
between the two identities. For example, if identities A 
and B participated in two successful transactions for $5 
and $10, there would be an A <> B link with weight $15. 
Note that link weights must always be non-negative. 

The set of all such links forms an undirected network, 
which we refer to as the risk network. An example of 
such a network is shown in Figure 2 (a). Note that 
the risk network has a particularly useful property: The 
weights are automatically generated by user actions, and 
do not have to be explicitly provided by users. As we 
demonstrate below, the risk network can be used not only 
to gauge the risk between two identities who have con- 
ducted a transaction in the past, but also between arbi- 
trary identities who may not have directly interacted in 
the past. 


4.3 Design 


Bazaar is run behind-the-scenes by the online market- 
place operator. The basic operation of Bazaar is sim- 
ple: When a buyer is about to enter into a transaction, 
the marketplace operator queries Bazaar, which calcu- 
lates the max-flow in the risk network between the buyer 
and the seller. If the max-flow is below the amount of the 
potential transaction, the marketplace operator flags the 
transaction as potentially fraudulent. We discuss ways in 
which this output can be used by the marketplace oper- 
ator in Section 4.5, but for now, we assume that flagged 
transactions are blocked. 

The intuition for this approach lies in the observation 
above about shared risk. Consider a risk network with 
only two identities, connected by a link of weight w. The 
identities may be willing to engage in another transaction 
of value w, and if that is successful, then another trans- 
action for a higher amount. Bazaar generalizes this intu- 
ition, allowing identities who are not directly connected 
to engage in a transaction as long as there is a set of paths 
of sufficient weight connecting them. For example, in the 
network shown in Figure 2 (a), if A was about to buy a 
good from D, Bazaar would consider the flow on paths 
A+ B« DandA-+« C + D in order to determine 
D’s reputation from A’s perspective. 

In existing online marketplaces, feedback-based rep- 
utations are “global,” in the sense that everyone has the 
same view of a given user’s reputation. In Bazaar, repu- 
tations are a function of both the user who is being asked 
about as well as the user who is asking. As we demon- 
strate below, this approach allows Bazaar to mitigate rep- 


utation manipulation: Malicious users who conspire to 
inflate their reputations do not necessarily increase their 
reputations from the perspective of non-malicious users. 


4.3.1 Putting credit ‘‘on hold” 


The design of Bazaar is complicated by the fact that the 
buyer may not be able to determine whether the transac- 
tion was fraudulent immediately after sending payment 
for the good; generally, there is a delay between when 
he agrees to the transaction and when the good arrives. 
In order to prevent malicious sellers from abusing these 
outstanding transactions in the manner observed in Sec- 
tion 3.1, when the buyer decides to go through with the 
transaction, Bazaar first determines a path set? between 
the buyer and seller that has a total weight of at least the 
transaction amount. Such a path set must exist, as, other- 
wise, the max-flow between the buyer and seller is lower 
than the transaction amount (meaning Bazaar would have 
flagged the transaction as potentially fraudulent). 

Once the path set is determined, Bazaar temporarily 
lowers the weights on these paths (in aggregate) by the 
transaction amount. In essence, this puts the weight on 
these paths “on hold” until feedback concerning the suc- 
cess or failure of the transaction is received. Since each 
link weight must always be non-negative, this approach 
prevents the malicious users from leveraging the weight 
that is “on hold” in order to conduct additional transac- 
tions. 

Continuing with our running example in Figure 2, the 
initial state of the risk network is shown in Figure 2 
(a), with each identity having participated in transactions 
with two other identities. Then, suppose that A con- 
ducts a $10 transaction with D. Bazaar determines that 
the max-flow between A and D is greater than $10, and 
therefore allows the transaction to go through without be- 
ing flagged. In doing so, Bazaar temporarily lowers the 
links along the path set by a total of $10 (specifically, $2 
is lowered off of the A «+ B << D path and $8 is low- 
ered off of the A «< C’ ~ D path). This is shown in 
Figure 2 (b). 


4.3.2 Responding to feedback 


Finally, once the buyer provides feedback about the 
transaction, Bazaar makes changes to the risk network. 
These changes depend on the feedback from the buyer: 


e Positive feedback If the buyer reports a success- 
ful transaction, indicated by positive feedback, 
Bazaar restores the temporarily lowered weight and 
additionally creates a new link directly between 


3If multiple path sets exist that have sufficient weight, Bazaar sim- 
ply picks one of these sets randomly. 
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Figure 2: State of the risk network while A conducts a $10 transaction with D. The state is shown (a) before the 
transaction, (b) while waiting for feedback, (c) if the buyer reports an negative feedback, (d) if the buyer reports a 
positive feedback, and (a) again, if the buyer reports neutral feedback or the timeout expires. 


the buyer and seller weighted by the transaction 
amount.* This has the effect of both restoring the 
network to its previous state, and creating a new 
risk link between the buyer and seller. The intuition 
for this action follows from the discussion above, 
whereby the buyer and seller are more likely to en- 
ter into a future transaction together. 


e Neutral feedback If the buyer reports a par- 
tially successful transaction, indicated by neutral 
feedback, Bazaar restores the temporarily lowered 
weight, but does not create a new link. This has the 
effect of restoring the network to its previous state, 
with no changes. The intuition for this action is that 
users who provide neutral feedback are not claim- 
ing that the transaction was fraudulent, but are not 
completely satisfied. Thus, the buyer is not likely 
to enter into a future transaction with the seller, but 
does not wish to punish the seller by providing neg- 
ative feedback. 


e Negative feedback If the buyer reports an un- 
successful transaction, indicated by negative feed- 
back, Bazaar makes the temporary lowering of the 
weights permanent and does not create any new 
links. This has the effect of reducing weight on the 
seller’s links, thereby decreasing the seller’s ability 
to conduct transactions in the future without having 
them flagged. In particular, if the seller conducts 
many transactions that end up with negative feed- 
back, eventually, all of his links will be exhausted, 
and he will be unable to conduct any non-flagged 
transactions. 


e No feedback Finally, if the buyer does not report 
feedback at all, a configurable timeout of T' is used, 
after which Bazaar responds as if the buyer pro- 
vided neutral feedback (i.e., the temporarily low- 
ered weight is restored, but no new link is created). 
This is similar to existing sites, which often have a 
time cutoff for providing feedback. 


‘If a direct link already existed, then Bazaar simply increases that 
link’s weight by the transaction amount. 
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Returning to our running example in Figure 2, sup- 
pose that the feedback is received or the timeout occurs. 
Bazaar either makes the weight reductions permanent if 
the buyer reports negative feedback (Figure 2 (c)), re- 
stores the previous weights and also forms anew A ~ D 
link if the buyer reports positive feedback (Figure 2 (d)), 
or restores the previous weights if the buyer reports neu- 
tral feedback or the timeout occurs (Figure 2 (a)). 


The intuition for why Bazaar is able to prevent fraud 1s 
demonstrated by the network shown in Figure 3, where a 
malicious user X has created a number of identities (Xj 
... 5) and has conducted fictitious transactions between 
them (in essence, the weight on these links can be ar- 
bitrarily set by X). Without Bazaar, potential victim 7 
would only see X,’s fictitious feedback consisting of a 
number of positive entries. Not knowing that all of this 
positive feedback was from other identities owned by the 
same underlying user, Z would likely be defrauded. With 
Bazaar, however, the fictitious transactions do not con- 
tribute to the max-flow between Z and X,, and Bazaar 
is likely to flag the transaction as potentially fraudulent 
(even though Bazaar had no a priori knowledge that all 
X, identities belong to the same user). Moreover, should 


% ® 
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Figure 3: Example risk network, showing why Bazaar 
secures reputations (links represent previous real transac- 
tions, and double links represent fictitious transactions). 
Honest identity Z is considering entering into a trans- 
action with malicious identity X; (owned by the same 
user as X29... X5). Without Bazaar, X 1 appears to be a 
reputable seller. With Bazaar, the fictitious transactions 
do not increase the max-flow ($5) between Z and Xj, 
thereby preventing the reputation manipulation. 
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X use one of these identities to conduct a fraud—of no 
more than $5, since anything greater would be automat- 
ically flagged as potentially fraudulent—the Y + X, 
link will have credit put “on hold” and eventually re- 
duced (once the buyer provides negative feedback), re- 
gardless of which identity X selects as the seller. This is 
the case regardless of the number of identities X creates 
or how he creates fictitious transactions between them. 
In effect, Bazaar forces X to participate in successful 
transactions with other non-malicious users in order to 
increase his max-flow, and penalizes these links when- 
ever X conducts fraud. 


4.3.3. Bootstrapping 


New users, by definition, have no transaction history 
and therefore have a max-flow of 0 to all other users. 
To allow new users to participate without having all 
of their transactions flagged as potentially fraudulent, 
Bazaar uses two techniques. First, Bazaar allows users 
to create virtual links to their real-world friends (in the 
Same manner as malicious users can create links in the 
risk network between their identities by conducting ficti- 
tious transactions). This mechanism allows users to ob- 
tain a few “starter” links from the friends, without open- 
ing anew security vulnerability: Since the user’s friends 
are, in effect, vouching for the new user, the friends are 
putting their existing links on-the-line. If the new user 
defrauds others, not only would his links be penalized, 
but the links of his friends would be as well. 

Second, if the new user does not have any real-world 
friends in the marketplace, Bazaar allows him to option- 
ally provide the marketplace operator with an amount of 
money to hold in escrow. In return, the marketplace op- 
erator creates links between the new user’s identity and 
other, random identities with a total value of the amount 
in escrow. These newly created links allow the new user 
to participate in the marketplace. At some later time, 
the new user can request that the escrowed money be 
returned (and the marketplace operator will remove the 
created links). However, if the created links represent 
weight on hold, or if the they have been lost (due to a 
fraudulent transaction), the marketplace operator would 
refuse to return the escrowed money. This approach does 
not open up a new vector for attack, as (a) the most 
the new user could defraud is the amount of escrowed 
money, and (b) if the user does commit such a fraud, he 
would lose his escrowed money. In essence, such an at- 
tack would not allow a malicious user to gain any money. 


4.4 (Guarantees 


We now discuss the guarantees that Bazaar provides. In 
brief, Bazaar ensures that malicious users can only de- 


fraud others up to the total amount of successful transac- 
tions that they have participated in with non-malicious 
users. To see this, let us imagine a malicious user 
X, whose identity has outgoing links with weight to- 
taling ax. Each time X conducts a fraudulent transac- 
tion, some of his links are reduced, in aggregate, by the 
amount that he defrauds. Thus, once X has defrauded 
a total of ax, all of his links have been removed and 
he is prevented from participating in transactions in the 
future. Moreover, X cannot use the “window of oppor- 
tunity” (discussed in Section 3) to conduct fraud before 
feedback is provided, as Bazaar puts link weights on hold 
until the feedback is received. 

Moreover, the same analysis holds for any subgraph 
or any cut in the network. Thus, collusion between ma- 
licious users does not help; the users can only defraud 
together for the total of what they could defraud sepa- 
rately. This argument also explains why creating fake 
identities also does not help, as it is the cut in the net- 
work between the user’s identities and the rest of the net- 
work that bounds the amount that the user can defraud, 
instead of the number of identities the user has or the 
amount of fictitious feedback. The upshot is that Bazaar 
does not explicitly detect Sybil nodes or malicious users 
in the network, rather, it provides a strict guarantee on 
the amount of fraud that they are able to conduct. 

The implication of this analysis is that we can charac- 
terize the amount of fraud the malicious users are able 
to conduct, in aggregate. Let us partition the network 
in two groups: G, containing non-malicious identities 
who do not conduct fraudulent transactions, and /, con- 
taining malicious identities whose goal is to defraud oth- 
ers. Let us consider the cut in the network between these 
two sets, with total value c,yyq. We make two observa- 
tions: First, any links that lie along this cut must repre- 
sent non-fraudulent transactions between non-malicious 
users and malicious users; in essence, these represent in- 
stances where the malicious users were non-malicious. 
Second, any time one of the malicious users defrauds a 
non-malicious user, this cut is reduced by the amount of 
the fraud. Thus, malicious users can only defraud non- 
malicious users of up to cyyq before the two groups are 
partitioned and all of the malicious users’ transactions 
are flagged as potentially fraudulent to the non-malicious 
users. 


It is worth noting that this is a much stronger guar- 
antee than what can be provided today. For example, 
today, a user can potentially purchase a large amount of 
fictitious positive feedback with a low monetary invest- 
ment, use that feedback to appear as an non-malicious 
seller, and then defraud users of a significant amount of 
money. This problem is exacerbated by the fact that the 
defrauded users have to realize that they have been de- 
frauded before they can provide negative feedback and 
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warn others, leaving a significant window of vulnerabil- 
ity. Moreover, the malicious user can simply repeat this 
process with a new identity. By putting this bound in 
place, we are able to force the malicious user to par- 
ticipate in valid transactions with non-malicious users, 
thereby significantly reducing the attractiveness of com- 
mitting such a fraud. 


4.5 Discussion 


We now discuss a few deployment issues with Bazaar. 


User interaction The marketplace operator can use the 
output of Bazaar in multiple ways. For example, the mar- 
ketplace operator can provide strong fraud guarantees by 
not allowing flagged transactions to go through. Alterna- 
tively, the marketplace operator can require that flagged 
transactions use an escrow Service or insurance service, 
or can more closely scrutinize the transaction. The lat- 
ter options represent an additional incentive for the mar- 
ketplace operator to deploy Bazaar, as selling additional 
services such as escrow or insurance may increase their 
revenue while at the same time attracting customers due 
to a decrease in fraud. 


Providing honest feedback An additional concern is 
whether buyers are incentivized to provide honest feed- 
back on transactions in Bazaar. First, rational buyers 
have no incentive to provide incorrect negative feedback: 
By doing so, they penalize their own links and they pre- 
vent the creation of a new link between themselves and 
the seller. Since having more links is desirable (as it 
allows a user to participate in more and higher-valued 
transactions), buyers are disincentivized from providing 
incorrect negative feedback. Second, rational buyers also 
have no incentive to provide incorrect positive feedback. 
In particular, if they were unhappy with the transaction, 
providing positive feedback creates a new direct link to 
the seller; this is likely to be highly undesirable if the 
buyer felt defrauded, as it risks the buyer’s existing links. 


Targeted attacks Another possible concern is whether 
Bazaar introduces a new attack vector by allowing a ma- 
licious user to conduct a targeted attack on a seller by 
purchasing their goods and then always providing nega- 
tive feedback (thereby damaging the seller’s reputation). 
First, such an attack is possible in existing marketplaces, 
as malicious users can conduct this attack by creating nu- 
merous free identities and then purchasing the victim’s 
goods. Thus, Bazaar does not open up a new avenue for 
attack. Second, we note that Bazaar raises the bar on this 
attack, making it more difficult to conduct: With today’s 
marketplaces, the malicious users can purchase the vic- 
tim’s goods immediately after creating another identity. 
With Bazaar, the malicious users must first conduct non- 
fraudulent transactions in order to obtain enough links 
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to be able to conduct the attack, making such an attack 
significantly more difficult and less attractive. 


Compromised accounts If a user’s account password is 
compromised, an attacker can conduct fraudulent trans- 
actions on the user’s behalf, eventually causing the user 
to run out of links. However, this attack is not unique 
to Bazaar, since attackers could conduct the same attack 
with the reputation systems in-use today. Moreover, with 
Bazaar, the amount of fraud that can be conducted is still 
subject to the Bazaar bounds, whereas without Bazaar, it 
is potentially unbounded. 


Protecting sellers Bazaar, as described so far, focuses 
on protecting buyers from being defrauded by malicious 
sellers who manipulate their reputation. However, in cer- 
tain marketplaces, it may be necessary to protect sellers 
as well (e.g., from buyers who use fraudulent payment 
mechanisms like stolen credit cards). We leave protect- 
ing sellers to future work, with one comment: The need 
to protect sellers is somewhat mitigated by the fact that 
marketplace operators generally allow sellers to verify 
payment before shipping the good. 


Maintaining full network knowledge The design of 
Bazaar proposed so far requires knowledge of the com- 
plete risk network. This is not an unreasonable assump- 
tion, as online marketplaces are generally run by a sin- 
gle operator that has full knowledge of all transactions. 
Given this information, the marketplace operator can cre- 
ate and update the risk network as necessary. It may be 
possible to decentralize knowledge of the risk network, 
but this remains an open research question and is a sub- 
ject of future work. A decentralized system has several 
advantages with regards to privacy and scalability, but 
as we do not know of any decentralized online market- 
places, the path to deploy a decentralized solution is un- 
clear. 


5 Calculating max-flow using multi-graphs 


The Bazaar design described so far relies on finding the 
max-flow path between two nodes in order to calcu- 
late the amount of risk embedded in a potential trans- 
action. Since the risk network may have large number 
nodes and links, finding the max-flow between nodes us- 
ing traditional approaches like Ford-Fulkerson [8] and 
Goldberg-Rao [9] may prove to be expensive. Similarly, 
pre-computing max-flow values through techniques like 
Gomory-Hu Trees [12] may also prove too costly, and are 
complicated by the fact that the risk network is chang- 
ing over time. Instead, Bazaar uses a novel approach 
called multi-graphs in order to reduce the computation 
required. In this section, we first describe useful obser- 
vations on risk networks and of our desired max-flow al- 
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gorithm, detail the multi-graph data structure, and finally 
demonstrate how multi-graphs reduce the complexity of 
finding max-flow values. 


5.1 Observations 


We begin by making two observations concerning the 
risk networks in online marketplaces and the properties 
of the max-flow calculation in Bazaar. 


1. Dense core First, like social networks [16], the risk 
networks we observe in real-world online market- 
places tend to have a dense core, meaning a small 
minority of users possess the majority of the links. 
Moreover, the higher-valued links (representing risk 
relationships with higher values) also tend to fall 
in this “core.” As a result, the risk network tends 
to shrink rapidly if links with less than a specified 
weight are discarded. We demonstrate this with 
real-world data in the following section. 


2. Actual max-flow not needed Second, and most im- 
portant, Bazaar does not need to actually calculate 
the value of the max-flow between a potential buyer 
and seller. Instead, Bazaar simply needs to verify 
whether the max-flow is above a certain value (1.e., 
the value of the potential transaction). This implies 
that the complexity of calculating the max-flow in 
Bazaar may not be as high as a general max-flow 
calculation. 


The multi-graph optimization, described next, leverages 
both of these observations in order to reduce the com- 
plexity of the max-flow calculation in Bazaar. 


5.2 Miulti-graphs 
Formally, we define a multi-graph M to be a set of graphs 


M2. 4Gy.C5 0G ,) 


where each graph G; = (V;, E;). These graphs are re- 
lated: First, Go is defined to be the entire risk network. 
Second, G; is defined to be the subgraph of G;_1 with 


E; = 
Vy; = 


fe € E;_1 : w(e) > k*} 
{u:(v,-) € Bj} 


where w(e) represents the weight of edge e and k is a 
configurable system parameter with a suggested value of 
2. Thus, the multi-graph contains a series of risk net- 
works, where each subsequent network is a subgraph of 
the previous containing only those links with an expo- 
nentially higher weight. An example of converting a risk 
network into a multi-graph is shown in Figure 4. 








Figure 4: Conversion of a risk network (left) to a risk 
multi-graph (right). Links with higher weights are shown 
with thicker lines. Graphs at higher levels in the multi- 
graph only include links with exponentially increasing 
weights (e.g., with k = 2, the three levels of the multi- 
graph would represent all links, links with weight $2 and 
higher, and links with weight $4 and higher). 


Note that a multi-graph contains multiple copies of a 
given link, the weights of which need to be kept consis- 
tent. There are three operations on the risk network under 
which Bazaar must maintain consistency: 


e Link addition When a new link is added, it is sim- 
ply added to all of the graphs to which it belongs 
(e.g., if the link weight is w, the link is added to 


e Link weight change When the weight of a link 
is changed, it is simply added to or removed from 
the appropriate graphs. Conceptually, this can be 
viewed as removing the link from all graphs, fol- 
lowed by adding it back at its new value. 


e Link weight temporary adjustment Recall that 
Bazaar may temporarily lower the weight of a link 
when a transaction is in progress. Conceptually, this 
can be viewed as changing the weight of the link. 
Later, if the adjustment is undone, this can again be 
viewed as a weight change. 


5.3. Max-flow on multi-graphs 


Now, let us consider what happens when Bazaar calcu- 
lates whether a path set of total weight w exists between 
a source and destination. With a normal risk network, 
Bazaar must use an algorithm like Goldberg-Rao, which 
runs over the entire risk network and is optimized to de- 
termine the actual max-flow between the source and des- 
tination.In contrast, with a multi-graph, Bazaar proceeds 
by first finding the highest-weight network G;,, where 
both the source and the destination are present. Then, 
Bazaar runs any existing max-flow algorithm on Gy, 
looking for a set of paths of collective weight w. If such a 
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set is found, then the algorithm returns that set and is fin- 
ished. If no such set is found, Bazaar repeats the process 
with the next-lowest graph G,,_1. This process contin- 
ues until either a set of paths of weight w is found, or 
Bazaar cannot find such a set of paths in the lowest graph 
Go. The latter case indicates that the max-flow in the 
original risk network was lower than w, demonstrating 
that finding the max-flow in a multi-graph is guaranteed 
to have the same outcome as finding the max-flow in the 
original risk network. 

It is worth noting that multi-graphs require an increase 
in storage costs, since multiple copies of many links must 
be stored. However, as we demonstrate in the evaluation, 
the storage requirements of the multi-graphs are modest 
and are easily met by today’s computing hardware. 


5.4 Benefit of multi-graphs 


We now describe how the use of multi-graphs speeds up 
the max-flow calculation in Bazaar. Consider the case of 
a transaction of value w. First, because of observation 
1 above, the sizes of the graphs G; decrease extremely 
rapidly as 2 increases. Thus, running a max-flow algo- 
rithm over G; is significantly faster than running it over 
G;—1. Second, because of observation 2, it is possible to 
modify the max-flow algorithm to terminate as soon as it 
finds a path set of weight w, instead of continuing to find 
the actual max-flow. For example, if we are using Ford- 
Fulkerson, only a few rounds may be are needed in order 
to find a set of paths of weight w. Third, the increasing 
link weights in higher G;; further reduce the running time 
of the max-flow algorithm, as the path set in higher G; 
is likely to consist of only a few paths. As we demon- 
strate in the evaluation, these effects allow multi-graphs 
to significantly speed up the calculation in practice. 


6 Evaluation 


In this section, we present an evaluation of Bazaar. In 
particular, we use data collected from a real-world on- 
line marketplace to determine if the max-flow technique 
employed by Bazaar is able to detect and prevent fraud- 
ulent transactions. We describe the data collected, verify 
our observations in the previous section, demonstrate the 
performance gains of using multi-graphs, and present an 
evaluation of Bazaar on real-world data. 


6.1 Auction data 


In order to evaluate Bazaar, we collect data from eBay, 
the largest online marketplace. We focus on collect- 
ing data from the ebay.co.uk site, containing United 
Kingdom auctions. 
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Category Purchases Users Avg. Price 
Clothes 3,311,878 | 1,436,059 £9.45 
Collectibles 940,815 454,773 8.90 
Computing 964,925 661,285 21.31 
Electronics 861,108 652,350 20.67 
Home/Garden 2,795,795 | 1,426,785 16.57 
Total 8,874,521 | 3,168,455 £14.12 


Table 1: Distribution and monetary values of feedback 
seen in our trace. 


eBay makes the feedback for all users public. Each 
piece of feedback consists of the feedback value (posi- 
tive, negative, or neutral), the auction the feedback was 
for, the identity of the user providing feedback, and a 
short message from that user explaining the feedback. 
Feedback can be provided by both the buyer and seller, so 
each auction can result in two pieces of feedback. eBay 
only makes detailed feedback available for 90 days, after 
which time, information about the auction the feedback 
is for is removed, and only the feedback value, message, 
and providing user remain. Thus, we are only able to 
collect detailed feedback for the previous 90 days. 

eBay provides an API to collect data, but rate limits the 
requests to a very low rate. Instead, we use web scrap- 
ing to collect data. We start from one user and crawl 
their feedback profile. From this profile, we learn about 
other users and proceed to crawl them. We continue this 
process until we exhaust all known users, effectively per- 
forming a breadth-first-search of the feedback graph. 

In order to make our data collection process tractable, 
we only consider auctions and feedback that occur in five 
of the largest auction categories, shown in Table |. Thus, 
we do not crawl other users that appear in the feedback 
history if the auction is not in one of these five categories. 
Since eBay allows users to participate in international 
transactions, not all users we discover are located in the 
United Kingdom. We restrict our crawl to only consider 
users located in United Kingdom, leaving us with a to- 
tal of 3,168,455 distinct users (note that users may par- 
ticipate in multiple categories). Finally, because Bazaar 
focuses on protecting buyers from malicious sellers, we 
only collect feedback from buyers to sellers (and ignore 
feedback from sellers to buyers). In total, our dataset 
contains information on 8,874,521 items of feedback. 


6.2 Dense core of risk networks 


We now turn to validate our observation in Section 5 that 
motived our multi-graph design. Specifically, we exam- 
ine whether there tends to be a dense “core” of users 
in the risk network, which was necessary for the multi- 
graph representation to have acceptable overhead. To do 
SO, We use a Similar approach to prior studies [16] and ex- 
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Figure 5: Fraction of links remaining (bottom) and frac- 
tion of the remaining nodes in the largest SCC (top) as 
only higher-weighted links are considered. Even as the 
majority of links are discarded, the largest SCC still con- 
tains most nodes, indicating the presence of a core. 


amine the subgraph consisting of highly weighted links. 
We are interested in both the size and the connectedness 
of these subgraphs. Figure 5 shows how these two at- 
tributes vary as only higher-weighted links are consid- 
ered. As the threshold rises from £1 to £20, almost 80% 
of the links are discarded. However, the vast majority of 
the remaining nodes are still in the largest strongly con- 
nected component (SCC), indicating the presence of a 
strong core. For some of the categories, the largest SCC 
does not disintegrate until only links of over £100 are 
considered. This validates our observation from the pre- 
vious section, and indicates that multi-graphs are likely 
to speed up Bazaar’s max-flow calculations in practice. 


6.3 Miulti-graph performance 


We now turn to evaluate the benefits of using the multi- 
graph representation on the performance of finding max- 
flow paths. Specifically, we examine the tradeoff be- 
tween memory and speed; since multi-graphs store mul- 
tiple copies of certain links, they naturally have higher 
memory requirements than only using a risk network. 
First, we show the number of multi-graph levels and the 
resulting memory overhead, relative to the single graph, 
of storing a multi-graph in Bazaar in Table 2. As can be 
seen from the table, while the relative storage overhead 
is a 3- to 4-fold, the absolute overhead is small. 

Next, we turn to evaluate the speedup of verifying 
whether a max-flow exists using a multi-graph in Bazaar. 
To do so, we create separate risk networks from each 
of the five categories by aggregating our feedback trace, 
creating links between users who participated in transac- 
tions with positive feedback. We then randomly select 





Size Overhead 
Category (MB) || Levels Rel. Abs. (MB) 
Clothes 7.38 12 234.6% ti 
Collectibles 2.01 14 221.0% 4.44 
Computing 3.47 13 282.9% 9.83 
Electronics 3.23 13 255.9% 8.25 
Home/Garden | 7.31 13 251.8% 18.4 


Table 2: Memory requirements of a single graph repre- 
sentation of the risk network, and number of levels and 
overhead (both relative and absolute) of a multi-graph 
representation, with k = 2. 


1,000 pairs of nodes from each category and an amount 
from the prices in the observed auction trace. We cal- 
culate the time required to verify whether a set of paths 
exist with at least the selected auction amount between 
the pair of users. For this experiment, we used a machine 
with a 2.83 GHz Intel Xeon processor. 


Table 3 presents the results of this experiment. Using 
the multi-graph representation shows a significant per- 
formance gain, with speed-ups ranging between 1.92 x 
and 2.86x. In fact, with the multi-graph, most of the 
max-flow calculations take less than 6 seconds to com- 
plete. However, most of the calculations that are suc- 
cessful (e.g., a set of paths is found with at least the 
specified weight) finish quickly, while the calculations 
that eventually fail (e.g., no such set is found) take much 
longer to finish, thereby inflating the average. This trend 
is expected since a failure must traverse every graph in 
the multigraph, whereas a success has the potential to 
end early. This observation suggests a further avenue 
for speeding up the max-flow calculation in practice, by 
considering calculations that run longer than a specified 
amount of time to have failed. For example, in the Com- 
puting category, if all calculations that take longer than 
two seconds are considered to have failed, this would 
only misclassify 5.5% of the eventually to-succeed cal- 
culations, and would lower the average running time 
from 1.66 to 0.70 seconds. 


Regardless, even without this further optimization, the 
average max-flow calculations in the largest category we 
examine (Clothes) required 6.29 seconds, meaning that 
13,736 calculations could be completed per server per 
day. Using our trace, we determined that the highest 
number of auctions closing on a single day in this cat- 
egory was 80,846, meaning that Bazaar could be de- 
ployed in this category by purchasing a server with at 
least 6 cores. Of course, synchronization would need to 
be maintained to ensure that two cores were not using a 
single link at once. We observed, though, that such con- 
flicts occur rarely (0.0165% of the time in this category), 
implying that parallelism of the max-flow algorithm [1] 
is likely to provide significant performance gains. 


NSDI 711: 8th USENIX Symposium on Networked Systems Design and Implementation 193 


194 





Time (s) 
Category Single | Multi-graph | Speedup 
Clothes 18.0 6.29 2.86 X 
Collectibles 203 1.18 2.14x 
Computing 3.78 1.66 2,21 X 
Electronics pe al 1.41 1.92x 
Home/Garden 11.6 5.34 210% 


Table 3: Average max-flow calculation times, and rela- 
tive speedup when using multi-graphs with k = 2. 


6.4 Detecting fraud with Bazaar 


We now turn to examine how well Bazaar is able to de- 
tect fraudulent transactions. In particular, we are inter- 
ested in three aspects of Bazaar’s performance: First, 
what is the impact on non-malicious users? In other 
words, how often are non-malicious users’ transactions 
incorrectly flagged as potentially fraudulent? Second, is 
Bazaar able to bound the amount of fraud that malicious 
users are able to conduct? Third, what impact, in terms 
of the amount of fraud prevented, could we expect from 
Bazaar if it were deployed on a online marketplace? 

To conduct the evaluation, we use a random subset of 
80% of the feedback data to create a risk network for 
each of the five categories, and then use the remaining 
20% of the feedback data to simulate the operation of 
Bazaar. Because our data only represents a 90-day pe- 
riod, many of the users participate only in a single trans- 
action (and therefore have a max-flow of O to all other 
users). In order to reduce the bias caused by our short 
time-window of data, we only simulate users who we ob- 
serve to participate in at least five transactions during the 
time range. Finally, for each data point, we repeat the 
experiment 10 times using different random seeds. 

To simulate Bazaar, we need a few pieces of informa- 
tion from each auction transaction: the identity of the 
buyer and seller, the price of the auction, the purchase 
and feedback time, and the feedback itself. Our crawled 
data unfortunately only contains the purchase time for 
54.6% of the data.> So, for the auctions where the pur- 
chase time is not available, we artificially select a pur- 
chase time by subtracting a random “delay” from the 
feedback time. This delay is randomly drawn from the 
observed purchase-time-to-feedback-time delay distribu- 
tion of the other auctions. 


6.4.1 Impact on non-malicious users 


Our first evaluation examines the potential negative im- 
pact that Bazaar has on non-malicious buyers and sellers. 
The primary form that such impact takes is incorrectly 


>In more detail, the purchase time of fixed-price auctions—where a 
user sells multiple, identical items at a fixed price—is not available, as 
these auctions have multiple buyers purchasing the items. 
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Fraction of transactions 





Category incorrectly flagged 
Clothes 1.11% 
Collectibles 1.12% 
Computing S250 
Electronics 4.68% 
Home/Garden 2.43% 


Table 4: Fraction of non-fraudulent transactions that are 
incorrectly flagged as fraudulent by Bazaar. The fraction 
flagged incorrect is never higher than 5%, indicating that 
non-malicious users are largely unaffected. 


flagging transactions as potentially fraudulent. To de- 
termine the frequency with which this happens, we sim- 
ulate Bazaar without any malicious users and calculate 
the fraction of transactions that had positive feedback 
but that would have been flagged by Bazaar due to in- 
sufficient max-flow. The results of this experiment are 
shown in Table 4, listing the fraction of non-fraudulent 
transactions which are flagged as potentially fraudulent 
by Bazaar. The results show that no more than 5% of all 
non-fraudulent transactions are flagged, indicating that 
non-malicious users in Bazaar are largely unaffected. 


6.4.2 Blocking malicious users 


We now evaluate whether Bazaar is able to bound the 
amount of fraud that malicious users can conduct in prac- 
tice. Recall that Bazaar guarantees that each user is only 
able to conduct fraudulent transactions up to the amount 
of non-fraudulent transactions that he has participated in. 
Thus, we are interested in comparing how much fraud 
malicious users can conduct, relative to the amount of 
non-fraudulent transactions they participated in. 

To simulate the behavior of malicious users, consistent 
with prior studies [22], we randomly select 1% of the 
users to be malicious. For each user, we simulate Bazaar 
running with other, randomly selected users purchasing 
items from the malicious user. We then calculate the total 
amount of fraudulent transactions that each user can con- 
duct, until the point at which Bazaar flags all transactions 
with the malicious user as potentially fraudulent. 

Figure 6 presents the results from conducting this ex- 
periment, by plotting the amount of fraudulent transac- 
tions a malicious user can conduct versus the sum of 
the malicious user’s initial links. As can clearly be seen 
in the figure, Bazaar’s bound on the amount of fraudu- 
lent transactions holds: the amount of possible fraud is 
strictly bounded by the sum of the non-fraudulent trans- 
actions that the malicious user has participated in so far.° 


©A careful reader will note that malicious users are sometimes 
bounded to less than the actual total of their previous successful trans- 
actions. This occurs when, for example, a malicious user is the only 
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Figure 6: Aggregate amount of fraudulent transactions 
that malicious users can conduct versus the aggregate 
value of previous successful transactions. Also included 
is the expected bound (y = x). As expected, Bazaar en- 
sures that malicious users can only commit fraud up the 
amount of successful transactions that they have partici- 
pated in previously. 


Even if the malicious user whitewashes his account (by 
creating a new identity), or conducts a Sybil attack (by 
creating multiple identities and linking them by fictitious 
transactions), he is unable to conduct any more transac- 
tions that are not flagged as potentially fraudulent. 


6.4.3 Preventing fraud 


As a final point of evaluation, we examine the amount of 
fraud that Bazaar would prevent, were it to be deployed 
on a real-world online marketplace. In other words, what 
impact could we expect from Bazaar? 

To evaluate this, we use the same 90-day trace from the 
five eBay categories. Then, for each seller, we calculate 
the total amount of goods sold with positive feedback, 
and the total with negative feedback. Recall that Bazaar 
prevents any user from having more (price weighted) 
negative feedback than positive feedback, so the auctions 
that represent the excess negative feedback would have 
been flagged as potentially fraudulent. We therefore cal- 
culate the total of this excess, and determine what frac- 
tion of the overall negative feedback it represents. 

Table 5 presents the results. Bazaar would have 
flagged between 29% and 42% of all auctions that re- 
sulted in negative feedback as being potentially fraud- 
ulent, thereby possibly preventing these auctions from 
occurring. While we cannot say that all of these trans- 
actions represent fraud (e.g., the negative feedback could 
simply represent buyer’s remorse), the fact that these all 
come from sellers whose weighted negative feedback is 
greater than their weighted positive feedback strongly 
suggests so. In total, the auctions that Bazaar would have 
prevented represent £164,791.55 worth of goods, signifi- 


user that another user is linked to: Even though the malicious user’s 
total is increased, this link does not increase the max-flow to any other 
users (much in the manner of the X2...X5 identities in Figure 3). 


Fraction of all 





Category Total flagged | negative feedback 
Clothes £28,291.34 29.9% 
Collectibles 4,995.04 38.2% 
Computing 48,742.66 39.7% 
Electronics 34,476.87 42.6% 
Home/Garden 47,285.64 32.4% 
Total £164,791.55 36.0% 


Table 5: Total number of auctions with negative feed- 
back that would be flagged as potentially fraudulent, 
and the fraction of all auctions with negative feedback 
that this represents. Overall, Bazaar would have flagged 
£164,791.55 worth of auctions that eventually resulted in 
negative feedback, representing 36% of all such auctions. 


cantly bolstering the reliability of the online marketplace. 
Moreover, this amount is only for a 90-day period in the 
five categories we study; the amount is likely to be sig- 
nificantly higher if Bazaar were deployed on the entire 
marketplace and over a longer period of time. 


7 Related work 


Researchers have previously studied approaches to de- 
tecting auction fraud, usually relying on machine- 
learning techniques [4, 18] based on bidding behavior. 
While these techniques succeed at detecting some fraud- 
ulent users, they rely on characteristics of malicious be- 
havior. As a result, unlike Bazaar, these approaches do 
not provide a bound on the amount of fraud any user can 
conduct. Additionally, researchers have developed tech- 
niques [14, 21] to detect shill bidding, where users con- 
spire with others to artificially inflate the selling price of 
their auctions. Bazaar is complementary to this work, as 
it is not concerned with shill bidding, but rather, fraud 
caused by reputation manipulation. 

Other work [5, 10] has examined building reputations 
based on social relationships between users. While some 
of the techniques used are similar to Bazaar, Bazaar must 
determine pairs of trusting users itself (instead of as- 
suming pairwise trust is externally provided). This in- 
troduces significant challenges, but enables Bazaar to be 
deployed on existing sites. 

There is also significant work that studies the network 
formed by users who trust each other, and a number of 
research systems have already been proposed to lever- 
age this trust. Perhaps the most well known of these are 
the PGP web of trust [27] and the Advagato trust met- 
ric [2]. However, these systems are generally concerned 
with providing a stronger notion of identity, instead of 
bounding the amount of malicious activity. 

More generally, recent work has focused on detecting 
Sybil accounts using social networks [6, 25, 26]. These 
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approaches are not directly applicable to online market- 
places for two reasons: First, they assume the existence 
of a social network that is not necessarily present, and 
second, they only bound the number of Sybil accounts 
that are admitted, not on the amount of fraud that mali- 
cious users can conduct. Thus, even with Sybil detection 
algorithms, malicious users are still able to conspire to 
arbitrarily inflate each others’ reputations. 

Like other work [22], Bazaar uses a mechanism that 
is loosely based on the one used in Ostra [17], a system 
that uses a social network to block senders of unwanted 
communication. However, Bazaar differs from Ostra in 
three important ways. First, while Ostra is based on a rel- 
atively stable, unweighted social network, Bazaar uses a 
weighted risk network that is changing with every trans- 
action (e.g., links are added and removed, and the links 
weights can grow and shrink over time). Second, Os- 
tra assumes the trust network is given from an external 
source, while Bazaar constructs the risk network dur- 
ing the operation of the system. This requires Bazaar 
to face additional challenges, as malicious users are able 
to create links by participating in transactions (this is not 
possible in Ostra, as Ostra’s assumption is simply that 
links to non-malicious users take effort to form and main- 
tain). Third, Bazaar works by calculating the max-flow 
in the risk network, instead of simply finding a single 
path (as in Ostra). This induces significant engineering 
challenges and results in a system with a different set of 
guarantees. 


$8 Conclusion 


In this paper, we presented Bazaar, a system that 
strengthens user reputations in online marketplaces. 
Bazaar is based on max-flow calculations over a risk 
network, a data structure that encodes the amount of 
rewarded shared risk between participants. Using data 
on over 8 million purchases from a real-world online 
marketplace, we demonstrated that Bazaar is able to ef- 
fectively bound the fraud that malicious users are able 
to conduct, while only rarely impacting the transactions 
conducted between non-malicious users. 

Given the popularity of online marketplaces and the 
large amount of fraud that such marketplaces currently 
experience, our hope is that Bazaar can be used as a drop- 
in component on real-world sites. Bazaar is designed to 
be readily applied to such marketplaces. 
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Abstract 


Computational RFID (CRFID) tags embed sensing and 
computation into the physical world. The operation of 
the tags is limited by the RF energy that can be harvested 
from a nearby power source. We present a CRFID run- 
time, Dewdrop, that makes effective use of the harvested 
energy. Dewdrop treats iterative tasks as a scheduling 
problem to balance task demands with available energy, 
both of which vary over time. It adapts the start time 
of the next task iteration to consistently run well over a 
range of distances between tags and a power source, for 
different numbers of tags in the vicinity, and for light 
and heavy tasks. We have implemented Dewdrop on top 
of the WISP CRFID tag. Our experiments show that, 
compared to normal WISP operation, Dewdrop doubles 
the operating range for heavy tasks and significantly in- 
creases the task rate for tags receiving the least energy, 
all without decreasing the rate in other situations. Using 
offline testing, we find that Dewdrop runs tasks at better 
than 90% of the best rate possible. 


1 Introduction 


Computational RFID (CRFID) tags are an emerging 
technology in which sensing and computational abilities 
are added to traditional RFID tags. Passive UHF RFID 
tags run and transmit an identifier using energy gathered 
from the transmissions of nearby RFID readers; they are 
very small and have no battery or long-term energy store. 
This ability makes them widely useful in commercial set- 
tings to, for example, automate interactions with pass- 
ports and drivers licenses, identify animals, and track re- 
tail goods in manufacturing and supply chains. The ad- 
dition of sensing and computation with CRFIDs enables 
a broader range of sensing applications, including cold- 
chain monitoring, access control, embedded monitoring 
of bridges and planes, gestural interfaces, activity recog- 
nition, and non-intrusive physiological monitoring [2]. 
These and other applications depend on very small, long- 
lived nodes that can be deeply embedded into the physi- 
cal environment in ways that go beyond sensor nodes and 
approach the original vision of “smart dust” [28]. 

The research agenda associated with CRFIDs is now 
becoming defined as the community uses prototype tags 
to experiment with applications [3, 6, 9]. A fundamental 
problem for these devices is the efficient use of energy. 


Energy is the scarce resource that limits the amount of 
computation that can be performed because it must be 
harvested at low rates from signals transmitted by readers 
meters away. Further, to remain physically small and to 
power-up quickly, CRFIDs have miniscule energy stores 
compared to sensor network nodes. For example, the en- 
ergy store of the WISP [24] prototype tag is eight or- 
ders of magnitude smaller than the battery of the popu- 
lar Telos sensor mote[18]. This means that CRFIDs will 
typically exhaust and recharge their energy stores many 
times a second. In turn, it means that runtimes for sensor 
networks are of little use for CRFIDs. Sensor node run- 
times seek to keep long-term expenditures below long- 
term harvesting or to maximize node lifetimes measured 
in days [14]. In contrast, CRFID runtimes must take a 
short-term view to match lifetimes measured in millisec- 
onds. 


The problem we tackle in this paper is how CRFID 
tags can make efficient use of the available energy. The 
naive RFID power model on which CRFIDs are based 
is for the tag to turn on and run whenever it is powered 
by the reader. This approach works for traditional RFID 
tags because tag functionality is very simple (a state ma- 
chine with memory) and can be run in the worst case 
at the limit of the energy harvesting range. However, 
CRFID tasks consume greater energy with more compli- 
cated tasks that use sensors and computation. By adopt- 
ing the model of running whenever there is power, cur- 
rent CRFID designs reduce the range at which a CRFID 
tag functions and limit the kinds of tasks that can be run. 
Prior work has looked at tuning the CRFID hardware 
constants (e.g., capacitor sizes) to better match available 
energy to a specific task [8]. Instead, our approach is 
to view the need to match harvested energy to task con- 
sumption as a scheduling problem. We wake the tag out 
of deep sleep only when it is likely to execute a task ef- 
ficiently. This enables devices to run a range of tasks 
efficiently without requiring hardware modications. 


We present the design and evaluation of Dewdrop, an 
energy-aware runtime for CRFID tags. We have imple- 
mented Dewdrop on the Intel WISP tag, and have exper- 
imented by powering the tags using a commodity Impinj 
UHF RFID reader for a range of distances, number of 
competing tags, and light and heavy CRFID tasks. By 
waking tags at the right times, we find that we can run 
tasks where they previously could not run, and about as 
often as possible given the energy that the RF environ- 
ment provides. Prior to our work, the WISP had an oper- 
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ating range sufficient for point demonstrations. With our 
runtime, it is possible to use a single RFID reader to track 
CRFID tags on everyday objects in a room with enough 
responsiveness for activity inference. 

While Dewdrop is conceptually simple, we found a 
practical design difficult to achieve for several reasons. 
First, the energy needed to run a task and the input RF 
power both vary greatly over time due to factors such 
as non-deterministic protocols and reader frequency hop- 
ping. This hampers predictions of when to start the next 
task execution. Second, our intuition about energy stor- 
age as a simple reservoir proved wrong because a fixed 
amount of energy is more or less expensive to store de- 
pending on when it is gathered, and the rate at which 
it is consumed depends on when it is spent. This leads 
us to track other forms of waste. Finally, it is costly to 
gather the basic information needed to make scheduling 
decisions because CRFIDs are so energy impoverished. 
This required opportunistic measurement strategies and 
careful implementation. 

We make three contributions. First, we formulate the 
task scheduling problem for CRFID tags with limited en- 
ergy storage. Second, we present the design of a runtime 
that enables CRFID tags to adapt their behavior to best 
match task energy requirements to available energy over 
the factors that most affect efficiency. Third, we show by 
experimentation with the WISP tag and an Impinj RFID 
reader that our design is much more effective than prior 
techniques for real energy costs and RF conditions. Dew- 
drop doubles the operating range for heavyweight tasks 
as compared to the WISP hardware that runs tasks when- 
ever there is power, and keeps overhead low to match 
the performance for lightweight tasks to which the WISP 
hardware is well suited. 

The rest of this paper is organized as follows. We 
start with background in Section 2 and then define the 
task scheduling problem for CRFIDs in Section 3. We 
present the design of Dewdrop and its implementation in 
Sections 4 and 5. Our experimental evaluation is in Sec- 
tion 6. We follow with related work in Section 7 and 
conclude in Section 8. 


2 Background 


We begin with relevant background on computational 
RFID because it is an emerging research area. 


CRFID tags and the WISP. CRFID tags combine RFID 
technology for energy harvesting and backscatter com- 
munication with computation and sensing. The proto- 
type CRFID tag that we use is the Intel Wireless Identi- 
fication and Sensing Platform (WISP) [24]. Other pro- 
totype CRFID tags exist [21, 30], but the WISP is the 
most widely used because it is available to the academic 
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Figure 1: Gen 2 tag, Intel WISP, Telos mote. 


community. ! 

Figure 1 shows the WISP in comparison to a Gen 2 
UHEF RFID tag and a Telos mote. Like an RFID tag, it is 
small, thin, and battery-free. It runs only when powered 
by energy harvested from an EPC Gen 2 RFID reader 
and communicates with the reader using a low-energy 
form of signaling called backscatter. The current WISP 
can harvest sufficient power to operate at up to 4m. As 
advances in processor and sensor technology continue to 
reduce power consumption, the range of WISP tags will 
increase accordingly. 

Like a very low-end mote, the WISP is fully pro- 
grammable, capable of running small programs, and 
equipped with sensors. The WISP runs programs written 
in C on an ultra-low power 16-bit MSP430 microcon- 
troller and has 8K of flash memory, a 3D accelerometer, 
and temperature and light sensors. 

However, unlike an RFID tag, the WISP consumes 
considerably more power when computing, communicat- 
ing and sensing than can normally be harvested from the 
reader signal. Consequently, the WISP must duty cycle 
between a low-power sleep mode, in which the energy 
needed to run is gathered into a short-term energy buffer, 
and an active mode in which stored energy is consumed. 

We expect future CRFID tags to be more capable 
than the WISP, but to remain very-low end devices, 
even compared to sensor nodes. As the power efficiency 
of the devices improves slowly over time, so too will 
the sensing and processing demands that are placed on 
them; thus, the disparity between harvestable power and 
Operating power will remain. 


CRFID Applications. CRFID tags and readers are en- 
ablers for ubiquitous computing applications that benefit 


'See wisp.wikispaces.com for open-source WISP software 
and hardware designs. WISPs are in use at more than 30 universities. 
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from instrumentation on or as part of objects in the phys- 
ical world. For example, the WISP has been used to pro- 
totype applications for gesture-based access control [6], 
cold chain monitoring [29], and activity recognition for 
eldercare [3]. 


We delve into the last scenario to give one example of 
a workload that Dewdrop is intended to support. The au- 
tomatic recognition of the activities of elderly people can 
improve quality of life by helping elders remain in their 
own homes for longer with inexpensive care. It does this 
by tracking key indicators of well-being such as medi- 
cation adherence, mobility and exercise, food and water 
intake, changes in routine, and safety [17]. The use of 
CRFIDs for activity recognition can deliver a solution 
that is inexpensive and non-intrusive. CRFIDs with ac- 
celerometers can be affixed to objects in an elder’s home, 
and data gathered from the tags can be used to determine 
activity. This has advantages over existing solutions as 
it requires neither monitoring by cameras, which can in- 
vade privacy, nor on-body sensors, which can be incon- 
venient for elders. Additionally, this type of deployment 
would be difficult using motes because of their size and 
cost. 


In earlier work, we prototyped such a system by tag- 
ging objects an elderly person normally interacts with— 
her medicine cabinet, tea kettle, teacup, toothbrush, 
etc.—with CRFIDs with onboard accelerometers [3]. 
RFID readers were placed out of sight in the ceiling. 
Each CRFID repeatedly sampled its accelerometer and 
transmitted its value to the readers. The readers detected 
tags that moved by looking for changes in those values, 
for instance, when a CRFID-tagged medicine bottle is 
picked up. Activities such as preparing a meal and tak- 
ing medicine were then inferred from sequences of object 
use. 


We built our earlier system using WISPs and found 
that the system worked, albeit with a smaller coverage 
region and lower response rates than we expected. This 
meant that we needed to deploy multiple readers per 
room, and even then some tags responded infrequently, 
which degraded activity inference. After some investiga- 
tion, we determined that the WISPs were wasting much 
of the available energy. That discovery led to our work 
on Dewdrop. 


3 Problem 


Our goal is to run programs on CRFID tags in a way that 
makes the best use of the available energy, which in turn 
extends operational range and increases responsiveness. 
In this section, we formulate this goal as a scheduling 
problem and describe the key challenges. 
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Figure 2: Example message exchange of a reader identi- 
fying a tag. 


3.1 Task Model 


In our setting, a reader powers one or more nearby tags 
and requests that they perform tasks. Tags may come 
and go from the range of a given reader as the RF envi- 
ronment changes or the tag or reader moves. In keeping 
with other CRFID and RFID applications, we assume 
that each CRFID tag repeatedly executes a single fixed 
operation as often as possible (e.g., reporting a sample), 
but from time to time may be retasked to perform a differ- 
ent operation (e.g., switch from sampling the accelerom- 
eter to measuring the light level). Additionally, tags in 
the deployment may be executing different tasks. As a 
tag considers only one type of task at a time, scheduling 
the order and execution of multiple tasks on a single tag 
is both unnecessary and out of scope. 

We define a task to mean a short program that is run 
to completion without pause. While it may be possi- 
ble to break some tasks into phases, the timing require- 
ments of the tag hardware, the RFID protocol, and ap- 
plication requirements make it impractical to interrupt 
many tasks once they start. Due to the operating con- 
straints of a tag, tasks are fairly inflexible and have lim- 
ited functionality. They can support modest processing, 
e.g., for lightweight encryption, but generally consist of 
sensing and reporting operations. Even with this limited 
task diversity, tasks have very different power require- 
ments. For example, measuring the light level consumes 
much less power than activating and sampling the ac- 
celerometer. We experiment with examples at the lower 
and higher ends of this spectrum later in the paper. 

We assume that CRFID tags will be powered by a 
standard Gen 2 RFID reader, at least in the near future. 
This is likely, as it allows CRFID tags to take advan- 
tage of deployed and commodity infrastructure. Tasks 
often return a result to the reader. Contention between 
the transmissions of multiple tags is managed by the EPC 
Gen 2 MAC protocol [7] that is based on Framed Slot- 
ted Aloha [25]. To gather tag IDs, the reader transmits 
a Query command that indicates the number of slots in 
the frame. Tags then randomly choose a slot in which to 
reply, and transmit a 16-bit random number in their slot. 
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The reader ACKs this random number and the tag replies 
with a 96-bit identifier. An example of this exchange is 
shown in Figure 2, where no tag chooses the first two 
slots, and one tag responds in the third slot. Tags that 
collide in a slot are not ACKed and respond again after 
the next Query. The reader iteratively modifies the frame 
size to best match the number of tags that are present. 
Sensor and other data is transferred on top of this pro- 
tocol, either by overloading the identifier bits or using 
further commands that read and write tag memory. New 
MAC protocols specially designed for CRFIDs are also 
of interest, but we leave them to future work. 


3.2 Task Scheduling Goal 


Given that tags repetitively execute a task whenever pos- 
sible and the reader power is not controlled by the tags, 
maximizing energy efficiency is equivalent to maximiz- 
ing the rate at which tasks successfully complete. We 
use task completion rate, in terms of how many task it- 
erations succeed over a given time period, as a metric to 
evaluate the performance of Dewdrop in the steady state. 
Since energy falls off with distance (at least as quickly as 
distance squared), we expect the completion rate to fall 
with distance. But, it should not fall more quickly than 
the available energy. 

CRFID tags like the WISP collect the energy har- 
vested from RF signals into a capacitor that matches 
the fluctuating input power to the steady output power 
needed to run the tag. Energy is harvested whenever a 
nearby reader is transmitting an RF signal. Like an RFID 
tag, the WISP hardware begins task execution whenever 
a fixed, hardware-defined power level that is sufficient 
to activate the tag is reached. Once a task iteration has 
started, it may either run to completion or fail if the CR- 
FID tag runs out of energy first. We use this fixed, hard- 
ware approach as a baseline for comparison 1n our eval- 
uation. 

Dewdrop replaces the fixed, hardware approach with 
an adaptive software strategy. There is only one decision 
that a tag can make to improve energy efficiency: to defer 
the start of a task it could otherwise begin, sleeping until 
the energy store becomes more full. This is useful be- 
cause the larger store of energy increases the chance that 
the task will run to completion. However, it is waste- 
ful in terms of time and energy if the task would have 
succeeded anyway. The runtime’s job is to decide when 
to run and when to sleep depending on the task and RF 
environment. 


3.3. Challenge: Varying Task Needs 


A good runtime will not start a task unless there is 
(likely) sufficient energy to complete it, as failing a 
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task consumes energy without doing useful work. Yet 
whether a task will succeed is difficult to predict because 
task energy requirements vary greatly due to two main 
factors. 


Different size tasks. The energy consumption of differ- 
ent tasks can vary widely depending on the sensors they 
use, the computation they perform, and their communi- 
cation patterns. In our experiments, we consider a light 
task that simply takes an accelerometer reading, and a 
much heavier task that additionally uses the RFID com- 
munication protocol to send the accelerometer data to the 
reader by embedding it in the tag identifier. We refer to 
these as the SENSE and SENSETX tasks, respectively. 


Non-deterministic tasks. Tasks may be non- 
deterministic, which causes their energy requirements to 
vary from execution to execution. An important source 
of non-determinism is the RFID MAC protocol. The 
number of messages that a tag must process to commu- 
nicate with the reader depends on both the number of 
other tags present and the collisions that happen to oc- 
cur. As a consequence of the way the protocol works, a 
tag that chooses to take part in a communication round 
must complete the transaction; it cannot sleep or it will 
lose synchronization with the reader. Other sources of 
non-determinism may come from sensor data itself, the 
timing of reader queries (which a tag cannot control or 
predict) or random numbers used in security protocols. 


3.4 Challenge: Platform Inefficiencies 


The variation in task energy requirements suggest that a 
better strategy might be to overestimate the task needs. 
For example, a tag could harvest energy until its buffer 
is completely full before executing a task. In this way, it 
would run with “a full tank” to avoid preventable failures 
and top off between tasks. Unfortunately, storing excess 
energy is wasteful due to platform characteristics. 


Sublinear charging. CRFIDs use capacitors for energy 
storage as they are well suited to energy harvesting de- 
vices [12]. They charge quickly, recharge indefinitely, 
are small and inexpensive, and are non-toxic. How- 
ever, capacitors store energy faster when they are close to 
empty than when nearly fully charged. This nonlinearity 
is fundamental to the way capacitors work. As the ca- 
pacitor voltage, which increases with increasing charge, 
approaches the voltage supplied by the energy harvesting 
circuitry, the charging current decreases to zero. Thus, to 
increase the task rate, it makes sense to operate with a 
lightly charged capacitor. 


Superlinear discharge. Regulating circuitry must ad- 
just the supplied (input or stored) voltage to the operat- 
ing voltage. Differences in voltage levels inevitably lead 
to some voltage-dependent conversion losses. For exam- 
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ple, the WISP uses a linear regulator that sheds the volt- 
age difference by dissipating heat, which wastes energy. 
Other techniques are possible but come with their own 
tradeoffs (e.g., switching regulators are more efficient but 
have greater leakage, don’t work when the input voltage 
is near the target voltage, and are inefficient when they 
start up”). To minimize energy wasted while discharg- 
ing, the tag again should operate with its capacitor at a 
minimal charge. 

The exact inefficiencies will vary with the CRFID, but 
we believe that all real platforms will have these kinds of 
nonlinearities. The implication is that a quantum of en- 
ergy may cost (or be worth) a different amount depend- 
ing on when it is gathered (or spent), with excess energy 
being more wasteful. 


3.5 Challenge: Varying Input Power 


Even assuming that the tag runtime could accurately esti- 
mate tasks costs, it is difficult to know how long to sleep 
to store sufficient energy because the rate at which a tag 
harvests energy changes over time. 


Widely varying input powers. RF power received at a 
tag decreases at least as fast as the square of its distance 
from the reader. In practice, this means that the available 
energy varies by more than an order of magnitude over 
useful ranges. Hardwiring tags to operate at the low end 
of the power scale wastes a significant opportunity at the 
high end of the scale, and restricting tags to operate at 
the high end of the scale limits operational range. Ad- 
ditionally, CRFIDs harvest energy even when the task is 
being executed. When the tag is close to a reader less 
energy will be drained from the energy store than when 
further from the reader. Consequently, when close to the 
reader, less energy needs to be stored before execution 
can begin. 


Frequency selective fading. RFID systems operate in 
the 900MHz ISM band, so the reader must frequency 
hop every 400ms to obey FCC regulations. Multipath ef- 
fects result in different frequencies being attenuated dif- 
ferently. This means that the received power at tags can 
vary widely over short time scales. 


4 Design 


We now develop the design of our energy-aware runtime, 
Dewdrop. The main scheduling decision is when to start 
the next task iteration. Starting too soon wastes energy 
when the tag runs out of power and the task fails. Start- 
ing too late collects excess energy, which is inefficient 
to both store and use. Our approach is to minimize both 


This and other parts and design tradeoffs make the linear regulator 
the best choice for the WISP. 
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Figure 3: Voltage drop forSENSETX (upper black items) 
and SENSE (lower blue items). 


forms of waste. As we develop our design, we present 
microbenchmarks using the WISP to show the impor- 
tance of the different factors we identified as challenges. 


4.1 Design Goals 


From our problem formulation, the overarching goal of 
Dewdrop 1s to convert all available energy into completed 
task iterations. This goal is equivalent to two sub-goals 
that help to enable new applications: 


Increased range. We want our runtime to execute a 
task at greater distances from the reader than the base- 
line WISP hardware. Each task should work from next 
to the reader out to the distance at which the tag can no 
longer harvest enough energy for the task. 


Improved responsiveness. At all distances, we want to 
increase responsiveness compared to the baseline WISP 
hardware. We never want to noticeably decrease respon- 
siveness. 


Both goals are met by maximizing the task completion 
rate for a given task and distance from the reader. In 
practice, achieving them implies that we must meet two 
other goals: 


Low overhead. The implementation of Dewdrop must 
be extremely lightweight. Operations such as checking 
the level of the energy store or calculating sleep periods 
consume scarce energy. Even a modest amount of over- 
head can easily negate the benefits of scheduling tasks. 


Adaptation. Tags must operate well across a range of 
deployment scenarios. For example, they may be config- 
ured to run either heavy or lightweight tasks, and they 
must run their task efficiently both when near and far 
from a reader. Our performance sub-goals are stated 
across these factors, so Dewdrop must adapt to the en- 
vironment at runtime. 
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4.2 Variation in Task Costs 


To predict when to start a task, Dewdrop must estimate 
how much energy the task will need over and above the 
energy that will be harvested by the tag while it runs the 
task. This depends on the factors we previously identi- 
fied: the task itself, other tags competing for the medium, 
the distance from the reader and the frequency on which 
the reader is transmitting, and the amount of energy al- 
ready in the capacitor. All of these factors are fundamen- 
tal. However, they may differ in magnitude with implica- 
tions for system design. For example, if the energy needs 
depend mostly on the type of task, then each task could 
be profiled offline to characterize its fixed energy need. 

To understand how much these factors matter in 
practice, we ran an experiment with the SENSE and 
SENSETX tasks running on a WISP. For the WISP, the 
energy consumption of a task can be measured by the 
drop in the voltage of the capacitor that acts as a short- 
term energy buffer*. Figure 3 shows this voltage drop as 
a function of distance for the two tasks. Box plots show 
the distributions over at least 300 task executions at each 
distance. 

The SENSE task is deterministic. However, we see that 
the voltage drop is significantly larger when the tag is far 
from the reader than when it is close to the reader; it 
more than triples. This is because the input power from 
the reader varies by more than an order of magnitude. A 
second effect is that the variance is larger when the task 
is run close to the reader because the input power supple- 
ments stored energy and varies with the reader transmit 
frequency. At Im this variance is approximately 0.3V 
compared to 0.1V at 4m. 

Looking at the SENSETX task, the drop in voltage is 
almost three times larger than for SENSE. At 4m, the 
WISP cannot store sufficient energy to execute the task*. 
The variation is also higher at all distances because this 
task is non-deterministic. Its energy consumption de- 
pends on randomization in the Gen 2 MAC protocol, and 
the variation would be even greater if there were multiple 
WISPs (which we study as part of our evaluation). 

These results imply that Dewdrop should adapt to both 
the task and the environment in which the tag is operat- 
ing. Any fixed energy target at which to start a task will 
be either too low, causing the tag to fail at a distance 
when it could still run, or too high, causing the tag to run 
tasks more infrequently than it is capable of sustaining. 
A second implication is that it is likely not feasible to 
accurately estimate the energy needs of a particular task 
execution due to inherent variation. Instead, Dewdrop 


>The energy stored in a capacitor is calculated as $C'V?, where C 
is the capacitance and V is the measured voltage. 

*To even run the task over a range of distances we needed to modify 
the baseline WISP behavior. 
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Figure 4: WISP capacitor voltage over time 


must adapt an estimate of energy needs that captures the 
effects of the distribution. 


4.3 Minimizing Wasted Energy 


Sources of waste. Energy is wasted when the CRFID tag 
starts too early and fails to complete the task, or waits 
too long and inefficiently collects excess energy. How 
much energy is wasted in these cases depends on how 
CRFID tags convert reader energy into harvested energy 
and consume this energy. 

To gain some insight, we performed a simple exper- 
iment by charging a WISP without running any task. 
Figure 4 shows the voltage of the WISP capacitor as it 
charges at different distances. (The RF source powers on 
at approximately 200 ms.) This is the expected behavior. 
A capacitor’s charging rate decreases by a factor of e ev- 
ery RC’ seconds, where R and C are the resistance and 
capacitance of the RC’ circuit and e is the base of nat- 
ural logarithms, and asymptotically approaches zero as 
the capacitor charges to the voltage of the power source. 

This charging behavior has two implications. First, it 
shows the effects of distance. Far from the reader, the 
low received power limits the maximum energy that can 
be stored. At 4m the capacitor approaches only 2.75V, 
while at 1m it rises quickly to 5.8V (at which point an 
over-voltage protection circuit kicks in). This means 
that heavy tasks will not run as far from the reader as 
lightweight tasks no matter how long the tag sleeps. 

The second implication is that, even for a fixed input 
power, it is inefficient to charge to a higher voltage than 
necessary. Because the rate at which energy accumulates 
in a capacitor decreases exponentially as it charges, stor- 
ing excess energy wastes time. There is a penalty for 
charging too high and leaving spare energy in the capac- 
itor. In a sense, that leftover energy was “cheaper” to 
store. This effect is magnified by the linear regulator of 
the WISP, which consumes more power when there is a 
higher charge on the capacitor. 

To capture these factors, Dewdrop estimates waste in 
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terms of time. This directly accounts for the energy con- 
sumed by a task, even if it fails, and also for how long it 
took to store that energy. While the details will differ, all 
platforms are likely to have nonlinearities with respect 
to storing and consuming energy that make it useful to 
measure waste in terms of time. For instance, capaci- 
tors are the natural choice for short-term energy storage, 
and all CRFIDs that use capacitors will have this kind of 
inefficiency. 


Balancing sources of waste. Intuitively, starting tasks 
later, at a higher energy level, will decrease the time 
wasted due to tasks failing but increase the time wasted 
due to excess charging. Our goal is to minimize the total 
wasted time due to both causes. Since the energy cost of 
executing a task cannot be estimated precisely, Dewdrop 
aims to reduce the expected wasted time in the follow- 
ing manner. Let P( fazl|V,) be the probability that the 
task will fail given a starting voltage level V,. The run- 
time’s job is to choose a V, in the range [Vo, Vinax| that 
minimizes the wasted time: 


toaseed (Vs) = PUP GUVs) tunder 
= (1 -_ P(fail| Va) tower 


where ty nder 1S the time to charge back to V, after a fail- 
ure and toyer 1S the time spent overcharging, 1.e., the time 
spent charging beyond the energy level that would have 
been sufficient. Note that this implies that some rate of 
failures may be desirable as charging high enough to as- 
sure success incurs a penalty that accumulates on every 
execution. 

A naive approach to finding the V, that minimizes 
wasted time would be to try every value of V;. This is 
impractical, as the tag would need to examine a suffi- 
ciently long series of task execution attempts at each V, 
to determine which had the best performance. Further- 
more, this search would need to be repeated periodically 
as the RF environment and other factors change. 

To avoid this search, we use our intuition that the two 
kinds of wasted time tradeoff against each other to find an 
approximate solution. Let Py be the current task failure 
rate at a fixed starting voltage V; and Tunder = Py * 
tunder and dl eee — (l= P 5) et eper: I Tower eS Liens 
then the runtime is too conservative; it could have chosen 
a lower Vs. If Tunder >> Tover then it is being too 
aggressive; V, is too low and tasks are failing too often. 

Dewdrop uses the heuristic that balancing the two 
sources of waste tends to minimize overall wasted time; 
this at least finds a reasonable operating point by ensur- 
ing that neither factor is a major source of inefficiency. 
Additionally, tracking and comparing the two sources 
of wasted time requires minimal computation which is 
key for any viable solution. The balance point can be 
found by slowly updating V, to trade Tyunder against 


Tover. 10 do this, Dewdrop maintains separate estimates 
Of Tunder and Toye, that are updated with an exponen- 
tially weighted moving average (with parameter a) each 
time a task executes depending on its success or failure. 
The two estimates are then compared, and the energy 
level V, is adjusted by ( in the direction that will bal- 
ance the averages. That is, it is increased if more time is 
being wasted on failures than on charging too high. 

More precisely, let V. be the voltage at the end of run- 
ning a task, and Vo be the voltage at which the tag ceases 
to operate, and € be a small voltage. A task succeeds if 
and only if V. > Vo + «. Dewdrop computes estimates 
and uses them to adjust the target energy level, V, as fol- 
lows: 


rp _ J A-a)Tover + Otovers if Ve > Vo +e 
over — (1 — OL ier. if V. = Vo ae 
v8 a (1 _ CO aides if V. > Vo +e 
MEE | Liter Pbinicny UV. Vote 


if td oer > lander 
if tL wer > 1 oer 


Vs — B ’ 
V.= 
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Of course, there are degenerate cases where this 

heuristic will fail, e.g., tasks that exhibit bimodal energy 

consumption where some executions consume a lot of 

energy and some executions consume very little. But, 

based on applications we have seen in the literature, our 

approach is a good fit and has the benefit of being both 
simple and efficient. 


4.4 Charging to a Target Energy Level 


Given a target energy level, the CRFID runtime must ar- 
range for the task to begin execution when stored energy 
reaches that target. The baseline WISP uses hardware 
support in the form of a voltage supervisor to start exe- 
cution when the capacitor voltage reaches a fixed level of 
2V. Unfortunately, there are no designs for variable volt- 
age supervisors that can be used in CRFIDs to the best of 
our knowledge. 

Instead, Dewdrop uses a software polling approach to 
determine when the target energy level has been reached 
and execution should begin. It sleeps while energy is 
being harvested, and occasionally wakes up to sample 
the capacitor voltage using an analog to digital converter 
(ADC). This is a general strategy that can be used on 
most platforms regardless of how the target energy level 
is determined. 

However, polling is difficult to achieve at low cost be- 
cause charge times can vary over orders of magnitude 
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and waking up and sampling the capacitor consumes pre- 
cious energy. In our experiments with the WISP, we 
found that reaching a given threshold can take less than 
10ms or 100s of ms depending on the input power. This 
variation, combined with the non-trival cost of waking 
up to take a sample, means that polling at any fixed inter- 
val is problematic. If the tag is close to the reader, a long 
interval means that the tag will store excess energy and 
miss opportunities to execute tasks. Conversely, if the 
tag is far from the reader, it will accumulate energy very 
gradually and pay a disproportionately greater overhead 
if the interval is short. 

To gather energy over a large range of input pow- 
ers and target voltages, Dewdrop uses an exponentially 
adapted polling interval. Specifically, let V,. be the volt- 
age a tag has gained since it last woke up, and ¢ be the 
current sleep interval. Then, 


2t, if V,—V > 2V, 
t/2, ifV,;-V <V,/2 
i otherwise. 


baeut —= 


This mechanism is very lightweight because it only 
involves shift operations to scale the polling interval, not 
multiply, divide, or floating point operations (which are 
not likely to be available in hardware). In our evaluation 
we find it to be responsive, sleeping for short amounts 
of time at high input power, and to have low overhead, 
gathering energy out to low input power levels. 


5 Implementation 


The WISP firmware is written in a mix of C and assem- 
bly, for timing sensitive operations. The code can be 
broken down into two main components: the Dewdrop 
runtime and task support. The Dewdrop runtime code 
must execute quickly and infrequently to reduce over- 
head. Task support includes the Gen 2 RFID communi- 
cation protocol, which requires tags to respond to reader 
commands quickly, generally within 10s of microsec- 
onds. This section describes our implementation of a 
functioning prototype as it relates to these challenges. 


5.1 WISP Hardware 


The WISP draws approximately 600UA when the CPU 
is in active mode and 1.5uA when in a state-preserving 
sleep mode. By default, the WISP wakes up at a fixed 
power level; a voltage supervisor waits for sufficient 
power to operate (defined by its capacitor reaching 2V) 
and then triggers a hardware interrupt to wake the de- 
vice. We use the term HwFixed to refer to this hardware 
method of waking up at a fixed voltage. Dewdrop dis- 
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ables this mechanism and instead uses a timer interrupt 
to wake the device. 

The WISP stores energy in a 10OuF capacitor and the 
voltage of the capacitor can be sampled via its analog to 
digital converter.” If the voltage of the capacitor drops 
below 1.5V, the WISP will black out and lose all state. 
We found that the time to fully charge the capacitor var- 
ied from 10s to 100s of milliseconds, depending on dis- 
tance. Discharging a full capacitor to below 1.5V in the 
absence of a reader signal takes 10s of ms when active, 
but more than 8s when in sleep mode. Thus, the WISP 
can carry state across relatively long periods of reader 
inactivity by sleeping. 


5.2 Dewdrop 


Low power wake-up. Dewdrop puts the WISP into a 
deep sleep state for a specified period to gather energy, 
and the CPU is woken up by the timer interrupt. The 
process is repeated until the target wake-up voltage, V;, 
is reached. This approximates the behavior of a hard- 
ware voltage supervisor, which wakes a device when a 
specified voltage is reached, but allows us to vary V,. A 
potential drawback to this approach is an increased cur- 
rent draw due to keeping the crystal oscillator active to 
drive the timer, but in practice this increase is acceptably 
small (2 wA vs 1.5 wA with the crystal off). 


Low cost voltage sampling. Dewdrop checks the capac- 
itor voltage to see if enough energy has been stored to 
warrant starting a task, and goes back to sleep if not. The 
energy overhead of this polling approach is determined 
by the polling interval and how long the WISP must be 
awake for each sample. The per sample cost is directly 
proportional to how long the WISP must stay in active 
mode. Sampling the capacitor voltage should take 90s 
according to the MSP430 data sheet instructions for us- 
ing the ADC. However, we found that ADC values stabi- 
lized much faster— 20s including setup time — with suf- 
ficient accuracy (l0mV). This shorter awake time drasti- 
cally reduced the cost of voltage sampling. 


Calculating the energy storage rate. Dewdrop also 
tracks how quickly energy is being stored, as it uses this 
information to adapt the sleep period and to calculate 
how much time is wasted overcharging. Our adaptive 
sleep function generally results in a series of sleep peri- 
ods, where the WISP wakes up and checks its voltage, 
adjusts the sleep period, and returns to sleep. When a 
task completes, V. — V, tells us how much energy is 
leftover. We use the last period’s charging rate and the 
average charging rate over all periods to estimate how 
much time was wasted overcharging. When a task fails, 


>A 10 pF capacitor is a reasonable trade-off between charge time (a 
smaller capacitor charges faster) and charge capacity. 
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V; — Vo tells us how much energy was wasted. We use 
the average charging rate to calculate the time wasted un- 
dercharging. 


5.3. Task Support 


Order of operations. The computation and sensing 
components of tasks must take place before or after com- 
municating with the reader; the deadlines imposed by the 
Gen 2 protocol are too tight to interleave task processing 
and message handling. Therefore, in the SENSETX task, 
for example, the WISP samples the sensor immediately 
after waking up and then begins decoding reader com- 
mands and waiting for the next Query. 


Detecting task failures. To avoid blacking out and los- 
ing state, the WISP needs to detect when task failures are 
imminent and then quickly enter sleep mode. In other 
words, if the voltage drops below V, + € (see Section 4), 
the task must be aborted. In future hardware revisions of 
the WISP, we would like to trigger an interrupt when a 
minimum voltage threshold is reached. In the meantime, 
we approximate this behavior by manually inserting calls 
to the voltage sampling function in the task code. We 
found that an € of 0.15V was sufficient to protect against 
blackout. That is, if any voltage sample measures below 
1.65V, the WISP will sleep and record a task failure. 


Sampling the voltage during the communication phase 
proved difficult, but it was necessary because message 
processing is a major factor in energy consumption. The 
Gen 2 message timing constraints are such that the WISP 
does not have time to take a sample between messages 
without losing synchronization with the reader, even with 
a sampling time of only 20s. However, we found that 
we could carefully schedule a voltage sample during the 
preamble of every reader command, so long as the in- 
spection of the sample was deferred until after the com- 
mand was decoded. As the WISP must be in active mode 
to accurately track the preamble, this approach amortizes 
the cost of keeping the CPU active for decoding. This 
strategy makes it possible for us to closely track the volt- 
age of the capacitor at every reader command with es- 
sentially zero overhead. 


Randomness. The Gen 2 MAC protocol requires that 
tags choose slots randomly. As a source of randomness, 
we sample the voltage in the capacitor once immedi- 
ately when the WISP first powers up, and use this value 
as a seed for a pseudo-random number generator. The 
variance in this voltage sample, due to input power and 
noise in the ADC, gives us sufficient randomness. Alter- 
natively, we could have used SRAM state as a random 
source, with similar efficiency [11]. 


5.4 Monitoring Support 


Monitoring WISP state and operation for debugging and 
experimentation is difficult. Traditional methods for de- 
bugging embedded systems, such as a JTAG connection, 
would supply power to the WISP and change its behav- 
ior. Instead, we use a custom monitoring board we devel- 
oped for debugging WISPs [19]. The board communi- 
cates with a PC via USB, attaches to the debug and other 
output pins of the WISP, but does not add to or consume 
energy harvested by the WISP. The monitor board can 
also sample the voltage in the WISP’s capacitor. For our 
study, we instrument the WISP to toggle debug pins at 
key points in its operation, and the monitor board records 
what event happened and immediately samples the WISP 
capacitor to determine its voltage. This results in a trace 
of WISP operations from which we can determine task 
costs, and response rates even for tasks that do not com- 
municate with the reader. 


6 Evaluation 


In this section, we evaluate Dewdrop experimentally. We 
show that our approach of balancing sources of waste 
generally achieves 90% of the best possible response rate 
for the SENSETX and SENSE tasks and across a wide 
range of RF environments. Dewdrop improves perfor- 
mance over the default WISP runtime, providing appli- 
cations a benefit in terms of both improved coverage and 
higher response rates. 


6.1 Experimental Setup 


Our experiments were conducted using an Impinj Speed- 
way RFID reader that continuously transmits energy and 
commands. This is the normal reader behavior. For ex- 
periments involving a single tag, the WISP was placed on 
a poster board 1m from the reader antenna and the out- 
put power was variably attenuated from 30dBm (1 Watt), 
the maximum allowed for “Gen 2” readers, to 18dBm. 
This method increases repeatability by limiting the mul- 
tipath effects that would occur if we moved the WISPs. 
We present results in terms of an equivalent distance that 
is calculated using free-space propagation, as we find 
them to be more intuitive than results in terms of transmit 
power. 

In all experiments, we ran Dewdrop and the default 
WISP hardware, which we call HwFixed, that starts tasks 
at a fixed energy level of 2.0V. HwFixed provides a base- 
line for comparison. When possible, we also report re- 
sults for Oracle as the best result found from an exhaus- 
tive offline search of starting energy levels (at which the 
WISP wakes-up and starts a task) using 0.03V steps. We 
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Figure 5: Response rates when using Dewdrop and the 
HwFixed runtimes. 
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Figure 6: Response rates for Dewdrop and HwFixed 
compared to an oracle. 


report results for both the SENSE and SENSETX tasks 
described in Section 4.2. 

To evaluate our approach in a realistic deployment, 
complete with multipath effects, we deployed 11 WISPs 
with accelerometers on a 1.2m x .75m table of a model 
apartment at Intel Labs Seattle. This deployment is sim- 
ilar to that seen in [3], though we only consider a single 
workspace instead of the complete apartment. An RFID 
reader was installed in the ceiling and equipped with 
one antenna approximately 2m above the table point- 
ing downwards. We configured the reader to run the 
SENSETX task to gather samples continuously for one 
minute. We performed three separate trials for each con- 
figuration to allow for variability from both the RF envi- 
ronment and communication protocol. 


6.2 Using Energy More Effectively 


Dewdrop performance. We first assess how well Dew- 
drop performs compared to HwFixed for a single WISP. 

Figure 5 compares the response rate of SENSE and 
SENSETX when using the two runtimes. We find that 
the performance of Dewdrop consistently matches or ex- 
ceeds that of HwFixed. For the light SENSE task, the per- 
formance of Dewdrop closely matches that of HwFixed 
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Figure 8: Response rate and wasted time for SENSE and 

SENSETX at 3m. 


and actually performs better at Im. This is because, at 
close range, the received power supplements stored en- 
ergy enough to allow an energy level 0.2V below Hw- 
Fixed’s fixed value. 


In the case of the heavier SENSETX task, Dewdrop’s 
response rate decreases smoothly as reader power falls 
to 3.5m. HwFixed fails to execute the task beyond 1.5m. 
Dewdrop adapts to the higher energy requirements of this 
task, and stores more energy before beginning execution, 
whereas HwFixed does not. This improvement more than 
doubles the operating range of the tag. 

To find an upper bound on how well Dewdrop could 
work, we compare to the Oracle results. Gathering this 
test data takes hours and is thus not a candidate for a 
practical CRFID runtime. Figure 6 again shows the re- 
sponse rates for the two tasks when using HwFixed and 
Dewdrop, but the rates are normalized by the best rates 
found using the Oracle. We find that Dewdrop generally 
achieves better than 90% of the maximum rate seen by 
Oracle for both tasks. Interestingly, Oracle always beat 
HwFixed. This means that the fixed 2 V energy level was 
never the best choice. 


Evaluating Dewdrop’s choices. To understand why 
Dewdrop performs well, we looked at the starting energy 
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Figure 9: Charging time from 1.5V to 2V. 


levels it selects. Dewdrop must choose starting energy 
levels that are close to the best level found by the Oracle 
if it is to be efficient. To show that this is a non-trivial 
task, Figure 7 shows examples of response rate versus 
energy level curves. The figure is based on data from the 
Oracle for both tasks at 1.5 and 3m. 

We see that the best starting energy level varies widely 
for different tasks and at different distances. For SENSE, 
the best energy level is 1.9V at 1.5m, when input power 
close to the reader supplements stored power, and 2.1V 
at 3m. Similarly, for SENSETX the best level varies from 
2.5 to 3V over the same distance. These results empha- 
size that no fixed threshold will work either for all tasks 
or for all distances. For example, the best energy level 
for SENSETX at 3m is 3V. This level achieves only 50% 
of the maximum response rate for SENSE at the same 
distance. It is even worse if the best level for SENSE at 
3m is chosen, as SENSETX cannot execute the task even 
once at 3m with an energy level of 2.1 V. 

The figure also shows the operating points found by 
Dewdrop marked with Xs. We see that our runtime finds 
points very close to the best energy level despite the dif- 
ferences between response curves. Across all of our data 
the energy levels found by Dewdrop were within 0.1 V of 
the best level found by Oracle. 

To see how Dewdrop selects a good starting energy 
level, we looked at how it minimizes wasted time. We 
calculated the average wasted time per task due to fail- 
ing and due to charging too high. Figure 8 shows this 
data, along with response rate, for an illustrative case of 
SENSE and SENSETX at 3m. The data are normalized 
by their maximum values. We see that as the starting 
energy level increases, the average wasted time due to 
failing generally decreases. (The waste is low at low 
wake-up thresholds despite tasks failing a greater frac- 
tion of attempts. This is because waste is computed in 
terms of time spent charging, and at low wake-up thresh- 
olds, very little time is spent charging.) Beyond 2.6V, 
waste from failed tasks decreases, as the task fails less 
often. Conversely, the wasted time from overcharging 
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Figure 10: Effect of step size (2) on response rate for 
SENSETX at 3.5m. 


increases with the starting energy level because the en- 
ergy is stored less efficiently at higher voltages. 

Dewdrop seeks the intersection of the two waste 
curves, and uses the corresponding energy level. This 
appears to be a good strategy as the maximum response 
rate in the figure occurs near the intersection. Moreover, 
since the rates plateau around the maximum, Dewdrop 
can miss its mark by a fairly wide margin (+0.1V), with- 
out affecting performance significantly. Though the fig- 
ure shows only a single example, we found the energy 
level that equalized the two sources of waste generally 
achieved better than 95% of the maximum rate for both 
tasks at all distances. 


Evaluating Dewdrop’s costs. 

This section investigates two possible inefficiencies 
in Dewdrop: the cost of our timer-based adaptive sleep 
scheme, and the effect of our choice of step size for main- 
taining the starting energy level. We show that both are 
efficient, which is in keeping with our runtime perform- 
ing almost as well as the Oracle. 

To be effective, our runtime must not appreciably in- 
crease charging time. Figure 9 shows the median charg- 
ing time from 1.5V to 2V for Dewdrop’s adaptive sleep 
mechanism, the hardware wake-up of HwFixed, and 
two strawman versions of our software controlled sleep 
mechanism that use fixed sleep periods. 

We find that, at all distances, our adaptive scheme 
achieves a charge time within 5% of the charge time of 
the hardware mechanism. Moreover, as expected, its per- 
formance is good over a wider range of distances than 
schemes that do not adapt their sleep periods. For ex- 
ample, the fixed period of 100ms does well at 4m (1.3% 
longer than HwFixed), but performs poorly at close range 
(600% longer than HwFixed at 1m). Likewise, fixing the 
period at 1Oms works well at close range, but incurs sig- 
nificant overhead farther away (32% at 4m). 

The second potential source of inefficiency in our sys- 
tem comes from our choice of step size (8) when seeking 
the best starting energy level. In Dewdrop, upward pres- 
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Figure 11: Percent of tags that have an average response 

rate above 1/s and 5/s using the two runtimes. 


sure on the level is only exerted after it drops fairly low 
and tasks begin to fail; after failures, the starting energy 
level rises until the cost of overcharging outweighs the 
cost of failing. A small @ increases the time it takes to 
adapt to environmental changes, while a larger (6 can re- 
sult in large oscillations around the ideal wake-up thresh- 
old. 

Figure 10 shows the effect of different step sizes on 
task rate for SENSETX at 3.5m. The average task rate per 
second is calculated over a 10 second sliding window. As 
step size increases, the task rates generally decrease and 
vary more widely. A larger step size means that Dewdrop 
increases/decreases its starting energy level too quickly, 
resulting in significant over/undercharging. The reverse 
then happens and the voltage is reduced by too much and 
more tasks fail. We found that a step size of 0.01V gave 
a good balance between damping oscillations in energy 
level and quickly adapting to environmental changes. 


6.3 Multiple Tag Evaluation 


Next, we evaluate Dewdrop in a realistic deployment 
consisting of multiple tags. To support CRFID appli- 
cations such as activity recognition, our runtime should 
both increase the coverage region of the reader (e.g., so 
that distant devices respond) and also increase the re- 
sponse rates of the devices (e.g., so that object motion 
can more accurately be tracked). We consider both of 
these metrics for the 11 WISPs deployed in the model 
apartment. 


Coverage. The coverage goal is to have as many devices 
as possible responding at a useful rate. Based on prior 
experience, we define two useful rates: a rate of 1/s, as 
is useful for low-rate object use detection; and a rate of 
5/s, as is useful for higher-rate gestural recognition. To 
characterize the coverage of the deployment, the transmit 
power of the reader is reduced gradually to determine 
the “headroom” (in dBm) tags have for a given level of 
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We find that Dewdrop has much better coverage than 
HwFixed because it enables tags to operate when much 
less incoming power is available. Figure 11 shows the 
percentage of tags with average response rates above 1I/s 
and 5/s when using the two runtimes. At 30dBm, all tags 
with Dewdrop respond at least once per second as com- 
pared to 64% with HwFixed. Coverage is better even 
when tags with Dewdrop receive one third the power of 
tags with HwFixed (viz., 67% for Dewdrop at 25dBm vs 
64% for HwFixed at 30dBm). Moreover, at a four-fold 
reduction in power (24dBm), 42% respond with Dew- 
drop while none respond with HwFixed. 

For a response rate of more than 5/s, the two runtimes 
perform equally well at 30dBm. This is because Hw- 
Fixed works well when a tag receives good power from 
the reader. However, HwFixed’s coverage decays much 
more quickly with power than does Dewdrop’s coverage, 
e.g., at 27dBm Dewdrop has three times the coverage of 
HwFixed. 


Response Rates. Figure 12 shows the distribution of the 
response rates of the tags when the reader is transmitting 
at 30 and 24dBm. The rates are computed over one sec- 
ond windows for both runtimes. We find that Dewdrop 
consistently achieves higher rates, especially for the tags 
receiving less energy; 30% of the data points are zero 
for HwFixed versus 5% for Dewdrop. Dewdrop’s abil- 
ity to achieve useful rates is even more apparent when 
the reader transmits at 24dBm and tags are receiving one 
fourth as much power. Dewdrop obtains response rates 
greater than once per second 30% of the time, as com- 
pared to 2% with HwFixed. At 30dBm, Dewdrop and 
HwFixed achieve nearly the same rates for those tags that 
receive the most energy; 25% of the data points are above 
9/s, and median rates are 5/s and 3/s respectively. 


6This “attenuation thresholding” technique [10], has been shown to 
be more appropriate for characterizing RFID deployments than varying 
distance due to the high sensitivity of RFID to multipath. 
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When more tags are present, the energy cost of com- 
municating with the reader increases. This is because the 
reader increases the number of slots it uses to limit the 
likelihood of tag collisions, so CRFID tags must process 
more messages before transmitting to the reader. 

Figure 13 gives the performance for a single tag when 
the reader transmits at 30dBm as additional tags are 
added to the deployment. The performance of HwFixed 
rapidly decreases with the number of tags. This is be- 
cause the number of slots is increasing, and a tag cannot 
remain powered when it chooses a later slot. In contrast, 
Dewdrop simply increases its starting energy level to 
accommodate the additional communication overhead. 
With one tag, it wakes up around 2.5V whereas with 25 
tags it wakes up closer to 3V. The result is that Dewdrop 
provides nearly three times the response rate as HwFixed 
when 25 tags are present. 


7 Related Work 


There has been significant work on building energy har- 
vesting systems for sensor networks [27, 12, 1]. This 
work considers solar cells, but some conclusions ap- 
ply equally to CRFIDs, e.g., [12] finds that capacitors 
should be used as the primary buffer to tolerate rapid 
charge/discharge cycles. In [26, 13, 15], the schedul- 
ing problem for energy harvesting devices is considered. 
The scheduling problem for these systems differs signifi- 
cantly from CRIFDs as they manage tasks and harvested 
power on the order of days, attempt to extend lifetime to 
months, and have no penalty for storing excess energy. 
In contrast, Dewdrop must store sufficient energy for a 
single task execution, and tolerate input power variations 
on the order of milliseconds in a context where every op- 
eration consumes precious energy. 

Power management for CRFIDs has generally fallen 
into two categories; supplying additional energy and 
maintaining state information across power losses. AI- 
ternative methods of powering devices have been ex- 


plored [16], with [5, 23] proposing solar cells and TV 
transmitters for CRFIDs. These approaches provide 10’s 
of .W of supplemental power, an order of magnitude be- 
low the requirements of current CRIFDs, so energy still 
must be used efficiently. 

In [20], the authors use offline profiling to estimate 
when state should be saved on the WISP, or transmitted 
to the reader [22], due to impending depletion of the en- 
ergy store. We found that simply entering low power 
sleep mode is an effective way to maintain state, and 
it avoids the cost of writing to flash or transmitting to 
the reader in scenarios where the reader does not power 
off for long periods of time. In [8] the authors use of- 
fline modeling to help determine the appropriate capaci- 
tor size for a device designed to execute a particular task. 
While hardware modifications are necessary for tasks 
with dramatically different energy requirements, Dew- 
drop enables a wider range of tasks to be executed ef- 
ficiently for any given energy store. 

The WISP has been used to demonstrate power inten- 
sive applications that would benefit from our approach. 
RC5 cryptographic primitives were implemented in [4], 
and both cryptography and sensors have been used to in- 
crease the security of implantable medical devices [9], 
and credit cards [6]. For these applications, the energy 
requirements were far beyond what could be provided at 
range, and the studies were done using the WISP at close 
range. Dewdrop aims to enable such applications to op- 
erate more effectively at greater range. 


$8 Conclusion 


We presented a runtime for CRFID tags that makes ef- 
ficient use of the scarce available energy. Our runtime, 
Dewdrop, adapts a tag’s duty cycle to match the har- 
vested power to the sensing and computation cost of 
tasks. To do this, it estimates the time wasted by over- 
charging and by underestimating task needs, and uses the 
result to choose how much energy to buffer before start- 
ing a task. Using an implementation built on the WISP 
tag and a commodity RFID reader, we showed that Dew- 
drop runs tasks where prior techniques could not, and 
runs them at better than 90% of the best rate found by 
offline testing across a range of input powers, competing 
tags, and light and heavy tasks. Dewdrop’s adaptation 
effectively doubled the distance at which a tag executes 
tasks, which enables practical deployments. In an instru- 
mented living space, all tags responded at useful rate to a 
single reader in the ceiling as compared to only 64% with 
fixed buffering. At over twice the distance (one quarter 
the transmission power), 42% of the tags still responded 
with Dewdrop while none responded with fixed buffer- 
ing. We believe these performance levels bring us close 
to realizing a wide range of realistic CRFID applications. 
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Abstract 


We introduce SSDAlloc, a hybrid main memory manage- 
ment system that allows developers to treat solid-state 
disk (SSD) as an extension of the RAM in a system. 
SSDAlloc moves the SSD upward in the memory hier- 
archy, usable as a larger, slower form of RAM instead 
of just a cache for the hard drive. Using SSDAlloc, ap- 
plications can nearly transparently extend their memory 
footprints to hundreds of gigabytes and beyond without 
restructuring, well beyond the RAM capacities of most 
servers. Additionally, SSDAlIloc can extract 90% of the 
SSD’s raw performance while increasing the lifetime of 
the SSD by up to 32 times. Other approaches either 
require intrusive application changes or deliver only 6— 
30% of the SSD’s raw performance. 


1 Introduction 


An increasing number of networked systems today rely 
on in-memory (DRAM) indexes, hashtables, caches and 
key-value storage systems for scaling the performance 
and reducing the pressure on their secondary storage de- 
vices. Unfortunately, the cost of DRAM increases dra- 
matically beyond 64GB per server, jumping from a few 
thousand dollars to tens of thousands of dollars fairly 
quickly; power requirements scale similarly, restricting 
applications with large workloads from obtaining high 
in-memory hit-rates that are vital for high-performance. 

Flash memory can be leveraged (by augmenting 
DRAM with flash backed memory) to scale the perfor- 
mance of such applications. Flash memory has a larger 
capacity, lower cost and lower power requirement when 
compared to DRAM and a great random read perfor- 
mance, which makes it well suited for building such ap- 
plications. Solid State Disks (SSD) in the form of NAND 
flash have become increasingly popular due to pricing. 
256GB SSDs are currently around $700, and multiple 
SSDs can be placed in one server. As a result, high- 
end systems could easily augment their 64-128GB RAM 
with 1—2TB of SSD. 

Flash is currently being used as program memory via 
two methods — by using flash as an operating system 
(OS) swap layer or by building a custom object store on 
top of flash. Swap layer, which works at a page granu- 
larity, reduces the performance and also undermines the 


lifetime of flash for applications with many random ac- 
cesses (typical of the applications mentioned). For every 
application object that is read/written (however small) an 
entire page of flash is read/dirtied leading to an unnec- 
essary increase in the read bandwidth and the number 
of flash writes (which reduce the lifetime of flash mem- 
ory). Applications are often modified to obtain high per- 
formance and good lifetime from flash memory by ad- 
dressing these issues. Such modifications not only need 
a deep application knowledge but also require an exper- 
tise with flash memory, hindering a wide-scale adoption 
of flash. It is, therefore, necessary to expose flash via 
a swap like interface (via virtual memory) while being 
able to provide performance comparable to that of appli- 
cations redesigned to be flash-aware. 

In this paper, we present SSDAlloc, a hybrid 
DRAM/flash memory manager and a runtime library 
that allows applications to fully utilize the potential of 
flash (large capacity, low cost, fast random reads and 
non-volatility) in a transparent manner. SSDAlloc ex- 
poses flash memory via the familiar page-based virtual 
memory manager interface, but internally, it works at an 
object granularity for obtaining high performance and for 
maximizing the lifetime of flash memory. SSDAlloc’s 
memory manager is compatible with the standard C pro- 
gramming paradigms and it works entirely via the virtual 
memory system. Unlike object databases, applications 
do not have to declare their intention to use data, nor do 
they have to perform indirections through custom han- 
dles. All data maintains its virtual memory address for 
its lifetime and can be accessed using standard pointers. 
Pointer swizzling or other fix-ups are not required. 

SSDAlloc’s memory allocator looks and feels much 
like the malloc memory manager. When malloc 
is directly replaced with SSDAlloc’s memory manager, 
flash is used as a fully log-structured page store. How- 
ever, when SSDAlloc is provided with the additional in- 
formation of the size of the application object being allo- 
cated, flash is managed as a log-structured object store. 
It utilizes the object size information to provide the ap- 
plications with benefits that are otherwise unavailable via 
existing transparent programming techniques. 

Using SSDAlloc, we have modified four systems built 
originally using malloc: memcached [4] (a key-value 
store), a Boost [1] based B+Tree index, a packet cache 
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arr | 15 | 43-127 [14-32 


Memcached 
B+Tree Index 
Packet Cache 
HashCache 





1 
15 3-12. 
P1540 [9 48-101x [13-23% 


Table 1: SSDAlloc requires changing only the memory alloca- 
tion code, typically only tens of lines of code (LOC). Depend- 
ing on the SSD used, throughput gains can be as high as 17 
times greater than using the SSD as swap. Even if the swap is 
optimized for SSD usage, gains can be as high as 3.5x. 


backend (for accelerating network links using packet 
level caching), and the HashCache [9] cache index. As 
shown in Table 1, all four systems show great benefits 
when using SSDAlloc with object size information — 


e 4.1-17.4 times faster than when using the SSD as a 
swap space. 

e 1.2-3.5 times faster than when using the SSD as a 
log-structured swap space. 

e Only 9-36 lines of code are modified (malloc re- 
placed by SSDAlIloc). 

e Up to 31.2 times less data written to the SSD for 
the same workload (SSDAlloc works at an object 
granularity). 


The rest of this paper is organized as follows: We de- 
scribe related work and the motivation in Section 2. The 
design is described in Section 3, and we discuss our im- 
plementation in Section 4. Section 5 provides the evalu- 
ation results, and we conclude in Section 6. 


2 Motivation and Related Work 


While alternative memory technologies have been cham- 
pioned for more than a decade [10, 25], their attractive- 
ness has increased recently as the gap between the pro- 
cessor speed and the disk widened, and as their costs 
dropped. Our goal in this paper is to provide a trans- 
parent interface to using flash memory (unlike the ap- 
plication redesign strategy) while acting in a flash-aware 
manner to obtain better performance and lifetime from 
the flash device (unlike the operating system swap). 
Existing transparent approaches to using flash mem- 
ory [18, 20, 23] cannot fully exploit flash’s performance 
for two reasons — 1) Accesses to flash happen at a page 
granularity (4KB), leading to a full page read/write to 
flash for every access within that page. The write/erase 
behavior of flash memory often has different expecta- 
tions on usage, leading to a poor performance. Full pages 
containing dirty objects have to be written to flash. This 
behavior leads to write escalation which is bad not only 
for performance but also for the durability of the flash 
device. 2) If the application objects are small compared 
to the page size, only a small fraction of RAM contains 
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useful objects because of caching at a page granularity. 
Integrating flash as a filesystem cache can increase per- 
formance, but the cost/benefit tradeoff of this approach 
has been questioned before [21]. 


FlashVM [23] is a system that proposes using flash 
as a dedicated swap device, that provides hints to the 
SSD for better garbage collection by batching writes, 
erases and discards. We propose using 16—32 times more 
flash than DRAM and in those settings, FlashVM style 
heuristic batching/aggregating of in-place writes might 
be of little use purely because of the high write ran- 
domness that our targeted applications have. A fully 
log-structured system would be needed for minimizing 
erases in such cases. We have built a fully log-structured 
swap that we use as a comparison point, along with na- 
tive linux swap, against the SSDAlloc system that works 
at an object granularity. 


Others have proposed redesigning applications to use 
flash-aware data structures to explicitly handle the asym- 
metric read/write behavior of flash. Redesigned applica- 
tions range from databases (BTrees) [19, 24] and Web 
servers [17] to indexes [6, 8] and key-value stores [7]. 
Working set objects are cached in RAM more efficiently 
and the application aggregates objects when writing to 
flash. While the benefits of this approach can be signifi- 
cant, the costs involved and the extra development effort 
(requires expertise with the application and flash behav- 
ior) are high enough that it may deter most application 
developers from going this route. 


Our goal in this paper is to provide the right set of 
interfaces (via memory allocators), so that both existing 
applications and new applications can be easily adapted 
to use flash. Our approach focuses on exposing flash only 
via a page based virtual memory interface while inter- 
nally working at an object level. Similar approach was 
used in distributed object systems [12], which switched 
between pages and objects when convenient using cus- 
tom object handlers. We want to avoid using any custom 
pointer/handler mechanisms to eliminate intrusive appli- 
cation changes. 


Additionally, our approach can improve the cost/ben- 
efit ratio of flash-based approaches. If only a few lines of 
memory allocation code need to be modified to migrate 
an existing application to a flash-enabled one with per- 
formance comparable to that of flash-aware application 
redesign, this one-time development cost is low com- 
pared to the cost of high-density memory. For exam- 
ple, the cost of 1TB of high-density RAM adds roughly 
$100K USD to the $14K base price of the system (e.g., 
the Dell PowerEdge R910). In comparison, a high-end 
320GB SSD sells for $3200 USD, so roughly 4 servers 
with 5TB of flash memory cost the same as | server with 
1 TB of RAM. 


USENIX Association 


USENIX Association 


Technique Logging <a page Dead pages/data Pollution Data Performance Ease 

psspswp | 

| SSDSwap(WriteLogeed) | Wf TP 
oa 


SSD mmap 


| ApplicationRewrite | Wo | WY | YY | vY | YY | VY || 
|sspallc | WH | WM | YW | Ww | Ww | vw | VY 


Table 2: While using SSDs via swap/mmap is simple, they achieve only a fraction of the SSD’s performance. Rewriting applica- 
tions can achieve greater performance but at a high developer cost. SSDAlIloc provides simplicity while providing high performance. 


name 





Table 3: SSDAlloc can take full advantage of object-sized ac- 
cesses to the SSD, which can often provide significant perfor- 
mance gains over page-sized operations. 


3 SSDAlloc’s Design 


In this section we describe the design of SSDAlloc. We 
first start with describing the networked systems’ re- 
quirements from a hybrid DRAM/SSD setting for high- 
performance and ease of programming. Our high level 
goals for integrating SSDs into these applications are: 


e To present a simple interface such that the appli- 
cations can be run mostly unmodified — Applica- 
tions should use the same programming style and 
interfaces as before (via virtual memory managers), 
which means that objects, once allocated, always 
appear to the application at the same locations in 
the virtual memory. 

e To utilize the DRAM in the system as efficiently as 
possible — Since most of the applications that we 
focus on allocate large number of objects and op- 
erate over them with little locality of reference, the 
system should be no worse at using DRAM than a 
custom DRAM based object cache that efficiently 
packs as many hot objects in DRAM as possible. 

e To maximize the SSD’s utility — Since the SSD’s 
read performance and especially the write perfor- 
mance suffer with the amount of data transferred, 
the system should minimize data transfers and 
(most importantly) avoid random writes. 


SSDAlloc employs many clever design decisions 
and policies to meet our high level goals. In Sec- 
tions 3.1 and 3.4, we describe our page-based virtual 
memory system using a modified heap manager in com- 
bination with a user-space on-demand page materializa- 
tion runtime that appears to be a normal virtual memory 





system to the application. In reality, the virtual memory 
pages are materialized in an on-demand fashion from the 
SSD by intercepting page faults. To make this intercep- 
tion as precise as possible, our allocator aligns the ap- 
plication level objects to always start at page boundaries. 
Such a fine grained interception allows our system to act 
at an application object granularity and thereby increases 
the efficiency of reads, writes and garbage collection on 
the SSD. It also helps in the design of a system that can 
easily serialize the application’s objects to the persistent 
storage for a subsequent usage. 

In Section 3.2, we describe how we use the DRAM 
efficiently. Since most of the application’s objects are 
smaller than a page, it makes no sense to use all of the 
DRAM as a page cache. Instead, most of DRAM is filled 
with an object cache, which packs multiple useful objects 
per page, and one which is not directly accessible to the 
application. When the application needs a page, it is dy- 
namically materialized, either from the object cache or 
from the SSD. 

In Sections 3.3 and 3.5 we describe how we manage 
the SSD as an efficient log-structured object store. In 
order to reduce the amount of data read/written to the 
SSD, the system uses the object size information, given 
to the memory allocator by the application, to transfer 
only the objects, and not whole pages containing them. 
Since the objects can be of arbitrary sizes, packing them 
together and writing them in a log not only reduces the 
write volume, but also increase the SSD’s lifetime. 

Table 2 presents an overview of various techniques 
by which SSDs are used as program memory today and 
provides a comparison to SSDAlloc by enumerating the 
high-level goals that each technique satisfies. We now 
describe our design in detail starting with our virtual ad- 
dress allocation policies. 


3.1 SSDAlloc’s Virtual Memory Structure 


SSDAlloc ideally wants to non-intrusively observe what 
objects the application reads and writes. The virtual 
memory (VM) system provides an easy way to detect 
what pages have been read or written, but there is no easy 
way to detect at a finer granularity. Performing copy-on- 
write and comparing the copy with the original can be 
used for detecting changes, but no easy mechanism de- 
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Figure 1: SSDAlloc uses most of RAM as an object-level 
cache, and materializes/dematerializes pages as needed to sat- 
isfy the application’s page usage. This approach improves 
RAM utilization, even though many objects will be spread 
across a greater range of virtual address space. 


that translates 
virtual memory 
addresses to the 


locations on the 
SSD 


termines what parts of a page were read. Instead, SS- 
DAlloc uses the observation that virtual address space is 
relatively inexpensive compared to actual DRAM, and 
reorganizes the behavior of memory allocation to use the 
VM system to observe object behavior. Servers typically 
expose 48 bit address spaces (256TB) while supporting 
less than 1TB of physical RAM, so virtual addresses are 
at least 256x more plentiful. 


We propose the Object Per Page (OPP) model, using 
which, if an application requests memory for an object, 
the object is placed on its own page of virtual memory, 
yielding a single page for small objects, or more (con- 
tiguous) when the object exceeds the page size. The 
object is always placed at the start of the page and the 
rest of the page is not utilized for memory allocation. In 
reality, however, we employ various optimizations (de- 
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scribed in Section 3.2) to eliminate the physical memory 
wastage that can occur because of such a lavish virtual 
memory usage. An OPP memory manager can be imple- 
mented just by maintaining a pool of pages (details of the 
actual memory manager used are given in Section 3.4). 
OPP is suitable for individual object allocations, typical 
of the applications we focus on. OPP objects are stored 
on the SSD in a log-structured manner (details are ex- 
plained in Section 3.5). Additionally, using virtual mem- 
ory based page-usage information, we can accurately de- 
termine which objects are being read and written (since 
there is only one object per page). However, it is not 
straightforward to use arrays of objects in this manner. 
In an OPP array, each object is separated by the page’s 
size as opposed to the object’s size. While it is possible 
to allocate OPP arrays in such a manner, it would re- 
quire some code modifications to be able to use arrays in 
which objects separated by page boundaries as opposed 
being separated by object boundaries. We describe later 
in Section 3.4 how an OPP based coalescing allocator 
can be used to allocate OPP based arrays. 


3.1.1 Contiguous Array Allocations 


In the C programming language, array allocations via 
malloc/calloc expect array elements to be contigu- 
ous. We present an option called Memory Pages (MP) 
which can do this. In MP, when the application asks for a 
certain amount of memory, SSDAlloc returns a pointer to 
a region of virtual address space with the size requested. 
We use a ptmalloc [5] style coalescing memory man- 
ager (further explained in Section 3.4) built on top of 
bulk allocated virtual memory pages (via brk) to obtain 
a system which can allocate C style arrays. Internally, 
however, the pages in this space are treated like page 
sized OPP objects. For the rest of the paper, we treat 
MP pages as page sized OPP objects. 

While the design of OPP efficiently leverages the vir- 
tual memory system’s page level usage information to 
determine application object behavior, it could lead to 
DRAM space wastage because the rest of the page be- 
yond the object would not be used. To eliminate this 
wastage, we organize the physical memory such that only 
a small portion of DRAM contains actual materializa- 
tions of OPP pages (Page Buffer) while the rest of the 
available DRAM is used as a compact hot object cache. 


3.2 SSDAlloc’s Physical Memory 
Structure 


The SSDAlloc runtime system eases application trans- 
parency by allowing objects to maintain the same virtual 
address over their lifetimes, while their physical loca- 
tion may be in a temporarily-materialized physical page 
mapped to its virtual memory page in the Page Buffer, 
the RAM Object Cache, or the SSD. Not only does the 
runtime materialize physical pages as needed, but it also 
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reclaims them when their usage drops. We first describe 
how objects are cached compactly in DRAM. 

RAM Object Cache — Objects are cached in RAM ob- 
ject cache in a compact manner. RAM object cache oc- 
cupies available portion of DRAM while only a small 
part of DRAM is use for pages that are currently in 
use (shown in Figure 1). This decision provides several 
benefits — 1) Objects cached in RAM can be accessed 
much faster than the SSD, 2) By performing usage-based 
caching of objects instead of pages, the relatively small 
RAM can cache more useful objects when using OPP, 
and 3) Given the density trends of SSD and RAM, object 
caching is likely to continue being a useful optimization 
going forward. 

RAM object cache is maintained in LRU fashion. It in- 
dexes objects using their virtual memory page address as 
the key. An OPP object in RAM object cache is indexed 
by its OPP page address, while an MP page (a 4KB OPP 
object) is indexed with its MP page address. In our im- 
plementation, we used a hashtable with the page address 
as the key for this purpose. Clean objects being evicted 
from the RAM object cache are deallocated while dirty 
objects being evicted are enqueued to the SSD writer 
mechanism (shown in Figure 1). 

Page Buffer — Temporarily materialized pages (in 
physical memory) are are collectively known as the Page 
Buffer. These pages are materialized in an on-demand 
fashion (described below). Page Buffer size is appli- 
cation configurable, but in most of the applications we 
tested, we found that a Page Buffer of size less than 
25MB was sufficient to bring down the rate of page ma- 
terializations per second to the throughput of the applica- 
tion. However, regardless of the size of the Page Buffer, 
physical memory wastage from using OPP has to be min- 
imized. To minimize this wastage we make the rest of the 
active OPP physical page (portion beyond the object) a 
part of the RAM object cache. RAM object cache is im- 
plemented such that the shards of pages that materialize 
into physical memory are used for caching objects. 

SSDAlloc’s Paging — For a simple user space imple- 
mentation we implement the Page Buffer via memory 
protection. All virtual memory allocated using SSDAI- 
loc is protected (via mprotect). A page usage is de- 
tected when the protection mechanism triggers a fault. 
The required page is then unprotected (only read or write 
access 1s given depending on the type of fault to be able 
to detect writes separately) and its data is then populated 
in the seg-fault handler — an OPP page is populated by 
fetching the object from RAM object cache or the SSD 
and placing it at the front of the page. An MP page is 
populated with a copy of the page (a page sized object) 
from RAM object cache or the SSD. 

Pages dematerialized from Page Buffer are converted 
to objects. Those objects are pushed into the RAM object 


cache, the page is then madvised to be not needed and 
finally, the page is reprotected (viamprotect)-—in case 
of OPP/MP the object/page is marked as dirty if the page 
faults on a write. 

Page Buffer can be managed in many ways, with the 
simplest way being FIFO. Page Buffer pages are unpro- 
tected, so our user space implementation based runtime 
would have no information about how a page would be 
used while it remains in the Page Buffer, making LRU 
difficult to implement. For simplicity, we used FIFO in 
our current implementation. The only penalty is that if a 
dematerialized page is needed again then the page has to 
be rematerialized from RAM. 

OPP can have more virtual memory usage than 
malloc for the same amount of data allocated. While 
MP will round each virtual address allocation to the next 
highest page size, the OPP model allocates one object 
per page. For 48-bit address spaces, the total number of 
pages is 2° (= 64 Billion objects via OPP). For 32-bit 
systems, the corresponding number is 27? (= 1 million 
objects). Programs that need to allocate more objects on 
32-bit systems can use MP instead of OPP. Furthermore, 
SSDAlIlloc can coexist with standard malloc, so address 
space usage can be tuned by moving only necessary al- 
locations to OPP. 

While the separation between virtual memory and 
physical memory presents many avenues for DRAM op- 
timization, it does not directly optimize SSD usage. We 
next present our SSD organization. 


3.3. SSDAlloc’s SSD Maintenance 


To overcome the limitations on random write behav- 
ior with SSDs, SSDAlloc writes the dirty objects when 
flushing the RAM object cache to the SSD in a log- 
structured [22] manner. This means that the objects have 
no fixed storage location on the SSD -— similar to flash- 
based filesystems [11]. We first describe how we man- 
age the mapping between fixed virtual address spaces to 
ever-changing log-structured SSD locations. Our SSD 
writer/garbage-collector is described later. 

To locate objects on the SSD, SSDAlloc uses a data 
structure called the Object Table. While the virtual 
memory addresses of the objects are their fixed locations, 
Object Tables store their ever-changing SSD locations. 
Object Tables are similar to page tables in traditional vir- 
tual memory systems. Each Object Table has a unique 
identifier called the OTID and it contains an array of in- 
tegers representing the SSD locations of the objects it 
indexes. An object’s Object Table Offset (OTO) is the 
offset in this array where its SSD location is stored. The 
2-tuple <OTID, OTO> is the object’s internal persistent 
pointer. 

To efficiently fetch the objects from the SSD when 
they are not cached in RAM, we keep a mapping between 
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each virtual address range (as allocated by the OPP or the 
MP memory manager) in use by the application and its 
corresponding Object Table, called an Address Transla- 
tion Module (ATM). When the object of a page that is 
requested for materialization is not present in the RAM 
object cache, <OTID,OTO> of that object 1s determined 
from the page’s address via an ATM lookup (shown in 
Figure 1). Once the <OTID,OTO> is known, the ob- 
ject is fetched from the SSD, inserted into RAM object 
cache and the page is then materialized. The ATM is 
only used when the RAM object cache does not have the 
required objects. A successful lookup results in a mate- 
rialized physical page that can be used without runtime 
system intervention for as long as the page resides in the 
Page Buffer. If the page that is requested does not be- 
long to any allocated range, then the segmentation fault 
is a program error. In that case the control 1s returned to 
the originally installed seg-fault handler. 


The ATM indexes and stores the 2-tuples < Virtual 
Memory Range, OTID> such that when it is queried 
with a virtual memory page address, it responds with the 
<OTID,OTO> of the object belonging to the page. In 
our implementation, we chose a balanced binary search 
tree for various reasons — 1) virtual memory range can 
be used as a key while the OTID can be used as a value. 
The search tree can be queried using an arbitrary page 
address and by using a binary search, one can determine 
the virtual memory range it belongs to. Using the queried 
page’s offset into this range, the relevant object’s OTO is 
determined, 2) it allows the virtual memory ranges to be 
of any size and 3) it provides a simple mechanism by 
which we can improve the lookup performance — by re- 
ducing the number of Object Tables, there by reducing 
the number of entries in the binary search tree. Our heap 
manager which allocates virtual memory (in OPP or MP 
style) always tries to keep the number of virtual memory 
ranges in use to a minimum to reduce the number of Ob- 
ject Tables in use. Before we describe our heap manager 
design, we present a few simple optimizations to reduce 
the size of Object Tables. 


We try to store the Object Tables fully in DRAM to 
minimize multiple SSD accesses to read an object. We 
perform two important optimizations to reduce the size 
overhead from the Object Tables. First — to be able to 
index large SSDs for arbitrarily sized objects, one would 
need a 64 bit offset that would increase the DRAM over- 
head for storing Object Tables. Instead, we store a 32 
bit offset to an aligned 512 byte SSD sector that contains 
the start of the object. While objects may cross the 512 
byte sector boundaries, the first two bytes in each sector 
are used to store the offset to the start of the first object 
starting in that sector. Each object’s on-SSD metadata 
contains its size, using which, we can then find the rest of 
the object boundaries in that sector. We can index 2TB of 
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SSD this way. 40 bit offsets can be used for larger SSDs. 

Our second optimization addresses Object Table over- 
head from small objects. For example, four byte objects 
can create 100% DRAM overhead from their Object Ta- 
ble offsets. To reduce this overhead, we introduce object 
batching — small objects are batched into larger contigu- 
ous objects. We batch enough objects together such that 
the size of the larger object is at least 128 bytes (restrict- 
ing the Object Table overhead to a small fraction — a5): 
Pages, however, are materialized in regular OPP style — 
one small object per page. However, batched objects are 
internally maintained as a single object. 


3.4 SSDAlloc’s Heap Manager 


Internally, SSDAlloc’s virtual memory allocation mech- 
anism works like a memory manager over large Object 
Table allocations (shown in Figure 1). This ensures that 
a new Object Table is not created for every memory 
allocation. The Object Tables and their corresponding 
virtual memory ranges are created in bulk and memory 
managers allocate from these regions to increase ATM 
lookup efficiency. We provide two kinds of memory 
managers — An object pool allocator which is used for 
individual allocations and a ptmalloc style coalescing 
memory manager. We keep the pool allocator separate 
from the coalescing allocator for the following reasons: 
1) Many of our focus applications prefer pool allocators, 
so providing a pool allocator further eases their devel- 
opment, 2) Pool allocators reduce the number of page 
reads/writes by not requiring coalescing, and 3) Pool al- 
locators can export simpler memory usage information, 
increasing garbage collector efficiency. 

Object Pool Allocator: SSDAlloc provides an object 
pool allocator for allocating objects individually via OPP. 
Unlike traditional pool allocators, we do not create pools 
for each object type, but instead create pools of differ- 
ent size ranges. For example, all objects of size less than 
0.5KB are allocated from one pool, while objects with 
sizes between 0.5KB and 1 KB are allocated from another 
pool. Such pools exist for every 0.5KB size range, since 
OPP performs virtual memory operations at page gran- 
ularity. Despite the pools using size ranges, we avoid 
wasting space by obtaining the actual object size from 
the application at allocation time, and using this size both 
when the object is stored in the RAM object cache, and 
when the object is written to the SSD. When reading an 
object from the SSD, the read is rounded to the pool size 
to avoid multiple small reads. 

SSDAlIloc maintains each pool as a free list — a pool 
starts with a single allocation of 128 objects (one Object 
Table, with pages contiguous in virtual address space) 
initially and doubles in size when it runs out of space 
(with a single Object Table and a contiguous virtual 
memory range). No space in the RAM object cache or 
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the SSD is actually used when the size of pool is in- 
creased, since only virtual address space is allocated. 
The pool stops doubling in size when it reaches a size 
of 10,000 (configurable) and starts linearly increasing in 
steps of 10,000 from then on. The free-list state of an ob- 
ject can be used to determine if an object on the SSD is 
garbage, enabling object-granularity garbage collection. 
This type of a separation of the heap-manager state from 
where the data is actually stored is similar to the “frame- 
heap” implementation of Xerox Parc’s Mesa and Cedar 
languages [15]. 

Like Object Tables, we try to maintain free-lists in 
DRAM, so the free list size is tied to the number of free 
objects, instead of the total number of objects. To re- 
duce the size of the free list we do the following: the 
free list actively indexes the state of only one Object Ta- 
ble of each pool at any point of time, while the alloca- 
tion state for the rest of the Object Tables in each pool 
is Managed using a compact bitmap notation along with 
a count of free objects in each Object Table. When the 
heap manager cannot allocate from the current one, it 
simply changes the current Object Table’s free list repre- 
sentation to a bitmap and moves on to the Object Table 
with the largest number of free objects, or it increases the 
size of the pool. 

Coalescing Allocator: SSDAlloc’s coalescing mem- 
ory manager works by using memory managers like pt- 
malloc [5] over large address spaces that have been re- 
served. In our implementation we use a simple best- 
first with coalescing memory manager [5] over large 
pre-allocated address spaces, in steps of 10,000 (config- 
urable) pages; no DRAM or SSD space is used for these 
pre-allocations, since only virtual address space is re- 
served. Each object/page allocated as part of the coalesc- 
ing memory manager is given extra metadata space in the 
header of a page to hold the memory manager informa- 
tion (objects are then appropriately offset). OPP arrays of 
any size can be allocated by performing coalescing at the 
page granularity, since OPP arrays are simply arrays of 
pages. MP pages are treated like pages in the traditional 
virtual memory system. The memory manager works ex- 
actly like traditional malloc, coalescing freely at byte 
granularity. Thus, MP with our Coalescing Allocator can 
be used as a drop-in replacement for log-structured swap. 


A dirty object evicted by RAM object cache needs to 
be written to the SSD’s log and the new location has to be 
entered at its OTO. This means that the older location of 
the object has to be garbage collected. An OPP object on 
the SSD which is in a free-list also needs to be garbage- 
collected. Since SSDs do not have the mechanical delays 
associated with a moving disk head, we can use a sim- 
pler garbage collector than the seek-optimized ones de- 
veloped for disk-based log-structured file systems [22]. 
Our cleaner performs a “read-modify-write” operation 


over the SSD sequentially — it reads any live objects at 
the head of the log, packs them together, and writes them 
along with flushed dirty objects from RAM. 


3.5  SSDAlloc’s Garbage Collector 


The SSDAIloc Garbage Collector (GC) activates when- 
ever the RAM object cache has evicted enough number 
of dirty objects (as shown in Figure 1) to amortize the 
cost of writing to the SSD. We use a simple read-modify- 
write garbage collector, which reads enough partially- 
filled blocks (of configurable size, preferably large) at 
the head of the log to make space for the new writes. 
Each object on the SSD has its 2-tuple <OTID,OTO> 
and its size as the metadata, used to update the Object 
Table. This back pointer is also used to figure out if the 
object is garbage, by matching the location in the Object 
Table with the actual offset. To minimize the number of 
reads per iteration of the GC on the SSD, we maintain in 
RAM the amount of free space per 128KB block. These 
numbers can be updated whenever an object in an erase 
block is moved elsewhere (live object migration for com- 
paction), when a new object is written to it (for writing 
out dirty objects) or when the object is moved to a free- 
list (object is “free’’). 

While the design so far focused on obtaining high 
performance from DRAM and flash in a hybrid setting, 
memory allocated via SSDAlloc is not non-volatile. We 
now present our durability framework to preserve appli- 
cation memory and state on the SSD. 


3.6 SSDAlloc’s Durability Framework 


SSDAlloc helps applications make their data persistent 
across reboots. Since SSDAlloc is designed to use much 
more SSD-backed memory than the RAM in the system, 
the runtime is expected to maintain the data persistent 
across reboots to avoid the loss of work. 

SSDAlloc’s checkpointing is a way to cleanly shut- 
down an SSDAlloc based application while making ob- 
jects and metadata persistent to be used across reboots. 
Objects can be made persistent by simply flushing all the 
dirty objects from RAM object cache to the SSD. The 
state of the heap-manager, however, needs more support 
to be made persistent. The bitmap style free list represen- 
tation of the OPP pool allocator makes the heap-manager 
representation of individually allocated OPP objects easy 
to be serialized to the SSD. However, the heap-manager 
information as stored by a coalescing memory manager 
used by the OPP based array allocator and the MP based 
memory allocator would need a full scan of the data on 
the SSD to be regenerated after a reboot. Our current 
implementation provides durability only for the individ- 
ually allocated OPP objects and we wish to provide dura- 
bility for other types of SSDAlloc data in the future. 

We provide durability for the heap-manager state of 
the individually allocated OPP objects by reserving a 
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known portion of the SSD for storing the correspond- 
ing Object Tables and the free list state (a bitmap). Since 
the maximum Object Table space to object size overhead 
ratio 1s a5) we reserve slightly more than +5 of the to- 
tal SSD space (by using a file that occupies that much 
space) where the Object Tables and the free list state can 
be serialized for later use. 

It should be possible to garbage collect dead objects 
across reboots. This is handled by making sure that our 
copy-and-compact garbage collector is always aware of 
all the OTIDs that are currently active within the SS- 
DAlloc system. Any object with an unknown OTID 
is garbage collected. Additionally, any object with an 
OTID that is active is garbage collected only according 
to the criteria discussed in Section 3.5. 

Virtual memory address ranges of each Object Ta- 
ble must be maintained across reboots, because check- 
pointed data might contain pointers to other check- 
pointed data. We store the virtual memory address range 
of each Object Table in the first object that this Object 
Table indexes. This object is written once at the time of 
creation of the Object Table and is not made available to 
the heap manager for allocation. 


3.7 SSDAlloc’s Overhead 


We observe that the overhead introduced by the SSDAI- 
loc’s runtime mechanism is minor compared to the per- 
formance limits of today’s high-end SSDs. On a test ma- 
chine with a 2.4 GHz quad-core processor, we bench- 
mark the SSDAlloc’s runtime mechanism to arrive at 
that conclusion. To benchmark the latency overhead of 
the signal handling mechanism, we protect 200 Million 
pages and then measure the maximum seg-fault gener- 
ation rate that can be attained. For measuring the the 
ATM lookup latency, we build an ATM with a million 
entries and then measure the maximum lookup through- 
put that can be obtained. To benchmark the latency of 
an on-demand page materialization of an object from the 
RAM object cache to a page within the Page Buffer, we 
populate a page with random data and measure the la- 
tency. To benchmark the page dematerialization of a 
page from the Page Buffer to an object in the RAM ob- 
ject cache, we copy the contents of the page elsewhere, 
madvise the page as not needed and reprotect the page 
using mprotect and measure the total latency. To 
benchmark the latency of TLB misses (through L3) we 
use a CPU benchmarking tool, the Calibrator [2], by al- 
locating 15GB of memory per core. Table 4 presents the 
results. Latencies of all the overheads clearly indicate 
that they would not be a bottleneck even for the high-end 
SSDs like the FusionIO [OXtreme drives, which can pro- 
vide up to 250,000 IOPS. In fact, one would need 5 such 
SSDs for the SSDAlloc runtime to saturate the CPU. 


The largest CPU overhead is from the signal han- 
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Page Dematerialization 0.172 
Signal Handling 0.666 


Combined Overhead 0.833 





Table 4: SSDAIloc’s overheads are quite low, and place an up- 
per limit of over 1 million operations per second using low-end 
server hardware. This request rate is much higher than even the 
higher-performance SSDs available today, and is higher than 
even what most server applications need from RAM. 


dling mechanism, which is present only because of a 
user space implementation. With an in kernel implemen- 
tation, the VM pager can be used to manage the Page 
Buffer, which would further reduce the CPU usage. We 
designed OPP for applications with high read random- 
ness without much locality, because of which, using OPP 
will not greatly increase the number of TLB (through L3) 
misses. Hence, applications that are not bottlenecked by 
DRAM (but by CPU, network, storage capacity, power 
consumption or magnetic disk) can replace DRAM with 
high-end SSDs via SSDAlloc and reduce hardware ex- 
penditure and power costs. For example, Facebook’s 
memcache servers are bottlenecked by network parame- 
ters [3]; their peak performance of 200,000 tps per server 
can be easily obtained by using today’s high-end SSDs as 
RAM extension via SSDAlloc. 

DRAM overhead created from the Object Tables is 
compensated by the performance gains. For example, a 
300GB SSD would need 10GB and 300MB of space for 
Object Tables when using OPP and MP respectively for 
creating 128 byte objects. However, SSDAlloc’s random 
read/write performance when using OPP is 3.5 times bet- 
ter than when using MP (shown in Section 5). Addition- 
ally, for the same random write workload OPP generates 
32 times less write traffic to the SSD when compared to 
MP and thereby increases the lifetime of the SSD. Ad- 
ditionally, with an in kernel implementation, either the 
page tables or the Object Tables will be used as they both 
serve the same purpose, further reducing the overhead of 
having the Object Tables in DRAM. 


4 Implementation and the API 


We have implemented our SSDAIloc prototype as a C++ 
library in roughly 10,000 lines of code. It currently sup- 
ports SSD as the only form of flash memory, though it 
could later be expanded, if necessary, to support other 
forms of flash memory. In our current implementation, 
applications can coexist by creating multiple files on 
the SSD. Alternatively, an application can use the entire 
SSD, as a raw disk device for high performance. While 
the current implementation uses flash memory via an I/O 
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Figure 2: SSDAlloc’s thread-safe memory allocators allow ap- 
plications to exploit the full parallelism of many SSDs, which 
can yield significant performance advantages. Shown here is 
the performance for 4KB reads. 


controller such an overhead may be avoided in the fu- 
ture [13]. We present an overview of the implementation 
via a description of the API. 

ssd_oalloc: void* ssd_oalloc( int numObjects, int object- 
Size ): is used for OPP allocations — both individual and 
array allocations. If numObjects is | then the object is al- 
located from the in-built OPP pool allocator. If it is more 
than 1, it is allocated from the OPP coalescing memory 
manager. 

ssd_malloc: void* ssd_malloc( size_t size ): allocates 
size bytes of memory using the heap manager (described 
in Section 3.4) on MP pages. Similar calls exist for 
ssd_calloc and ssd_realloc. 

ssd_free: void ssd_free( void* va_address ): deallo- 
cates the objects whose virtual allocation address is 
va_address. If the allocation was via the pool allocator 
then the <OTID,OTO> of the object is added to the ap- 
propriate free list. In case of array allocations, the in- 
built memory manager frees the data according to our 
heap manager. SSDAlloc is designed to work with low 
level programming languages like “C’. Hence, the onus 
of avoiding memory leaks and of freeing the data appro- 
priately is on the application. 

checkpoint: int checkpoint( char* filename ): flushes all 
dirty objects to the SSD and writes all the Object Tables 
and free-lists of the application to the file filename. This 
call is used to make the objects of an application durable. 
restore: int restore( char* filename) : It restores the SS- 
DAlloc state for the calling application. It reads the file 
(filename) containing the Object Tables and the free list 
state needed by the application and mmaps the necessary 
address for each Object Table (using the first object en- 
try) and then inserts the mappings into the ATM as de- 
scribed in Section 3.6. 

SSDs scale performance with parallelism. Figure 2 
shows how some high-end SSDs have internal paral- 
lelism (for 0.5KB reads, other read sizes also have paral- 
lelism). Additionally, multiple SSDs could be used with 


in an application. All SSDAlloc functions, including the 
heap manager, are implemented in a thread safe manner 
to be able to exploit the parallelism. 


4.1 Migration to SSDAlloc 


We believe that SSDAlloc is suited to the memory- 
intensive portions of server applications with minimal to 
no locality of reference, and that migration should not be 
difficult in most cases — our experience suggests that only 
a small number of data types are responsible for most of 
the memory usage in these applications. The following 
scenarios of migration are possible for such applications 
to embrace SSDAlloc: 


e Replace all calls to malloc with ssd_malloc: 
Application would then use the SSD as a log- 
structured page store and use the DRAM as a page 
cache. Application’s performance would be bet- 
ter than when using the SSD via unmodified Linux 
swap because it would avoid random writes and cir- 
cumvent other legacy swap system overheads that 
are more clearly quantified in FlashVM [23]. 

e Replace all malloc calls made to allocate mem- 
ory intensive datastructures of the application with 
ssdmalloc: Application can then avoid SS- 
DAlloc’s runtime intervention (copying data be- 
tween Page Buffer and RAM object cache) for non- 
memory intensive datastructures and can thereby 
slightly reduce its CPU utilization. 

e Replace all malloc calls made to allocate mem- 
ory intensive datastructures of the application with 
ssd_oalloc: Application would then use the 
SSD as a log-structured object store only for mem- 
ory intensive objects. Application’s performance 
would be better than when using the SSD as a log- 
structured swap because now the DRAM and the 
SSD would be managed at an object granularity. 


In our evaluation of SSDAlloc, we tested all the above 
migration scenarios to estimate the methodology that 
provides the maximum benefit for applications in a hy- 
brid DRAM/SSD setting. 


5 Evaluation Results 


In this section we evaluate SSDAlloc using microbench- 
marks and applications built or modified to use SSDAI- 
loc. We first present microbenchmarks to test the limits 
of benefits from using SSDAlloc versus SSD-swap. We 
also examine the performance of memcached (with SS- 
DAlloc and SSD-swap), a popular key-value store used 
in datacenters, where SSDs have been shown to mini- 
mize energy consumption [7]. Later, we benchmark a 
B+Tree index for SSDs, where we replace all calls to 
malloc with ssd_malloc to see the benefits and im- 
pact of an automated migration to SSDAlloc. 
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Figure 3: Microbenchmark results on 32GB object (128 byte each) array. In (a), OPP works best (1.8-3.5 times over MP and 
2.2—14.5 times over swap), MP and swap take a huge performance hit when write traffic increases. In (b), OPP, on all SSDs, trumps 
all other methods by reducing read and write traffic. In (c), OPP has the maximum write efficiency (31.5 times over MP and 1013 
times over swap) by writing only dirty objects as opposed to writing full pages containing them. 


After that, we compare the performance of systems 
designed to use SSDAlIloc to the same system specifi- 
cally customized to use the SSD directly, to evaluate the 
overhead from SSDAlloc’s runtime. We examine a net- 
work packet cache backend that was built using transpar- 
ent SSDAlloc techniques described in this paper and also 
the non-transparent mechanism described in our work- 
shop paper [8]. We also evaluate the performance of a 
web proxy/WAN accelerator cache index for SSDs intro- 
duced in prior work [9, 8] and similar to the problems 
addressed more recently [6, 14]. Here, we demonstrate 
how using OPP makes efficient use of DRAM while pro- 
viding high performance. 


In all these experiments we evaluate applications 
using three different allocation methods: SSD-swap 
(via malloc), MP or log-structured SSD-swap (via 
ssd_malloc), OPP (via ssd_calloc). Our evalua- 
tions use five kinds of SSDs and two types of servers. 
The SSDs and some of their performance characteristics 
are shown in Table 3. The two servers we use have a sin- 
gle core 2GHz CPU with 4GB of RAM and a quad-core 
2.4GHz CPU with 16GB of RAM respectively. 


5.1 


We examine the performance of random reads and writes 
in an SSD-augmented memory by accessing a large ar- 
ray of 128 byte objects — an array of total size of 32GB 
using various SSDs. We further restrict the accessible 
RAM in the system to 1.5GB to test out-of-DRAM per- 
formance. We access objects randomly (read or write) 2 
million times per test. The array is allocated using four 
different methods — SSD-swap (via malloc), MP ( via 
ssd_malloc), OPP (via ssd_oalloc). Object Tables 
for each of OPP, and MP occupy 1.1GB and 34MB re- 
spectively. Page Buffers are restricted to a size of 25 MB 
(it was sufficient to pin a page down while it was being 
accessed in an iteration). Remaining memory was used 
by the RAM object cache. To exploit the SSD’s paral- 
lelism, we run 8—10 threads that perform the random ac- 


Microbenchmarks 
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0p MP | SSD-swap_ 
[Std Dev (usec) | 66 | 98 | 287 


Table 5: Response times show that OPP performs best, since 
it can make the best use of the block-level performance of the 
SSD whereas MP provides page-level performance. SSD-swap 
performs poorly due to worse write behavior. 


cesses in parallel. 

The results of this microbenchmark are shown in Fig- 
ure 3. Figure 3(a) shows how (for the Intel X25-E SSD) 
allocating objects via OPP achieves much higher per- 
formance. OPP beats MP by a factor of 1.8-3.5 times 
depending on the write percentage and it beats SSD- 
swap by a factor of 2.2-14.5 times. As the write traffic 
increases, MP and SSD-swap fare poorly due to read- 
ing/writing at a page granularity. OPP reads only 512 
byte sector per object access as opposed to reading a4KB 
page; it dirties only 128 bytes as opposed to dirtying 4KB 
per random write. 

Figure 3(b) demonstrates how OPP performs better 
than all the allocation methods across all the SSDs when 
50% of the operations are writes. OPP beats MP by a 
factor of 1.4—3.5 times and it beats SSD-swap by a factor 
of 5.5-17.4 times. Table 5 presents response time statis- 
tics when using the Intel X25-E SSD. OPP has the lowest 
averages and standard deviations. SSD-swap has a high 
average response time compared to OPP and MP. This is 
mainly because of storage sub-system inefficiencies and 
random writes (quantified more clearly in [23]). 

Figure 3(c) quantifies the write optimization obtained 
by using OPP in log scale. OPP writes at an object gran- 
ularity, which means that it can fit more number of dirty 
objects in a given write buffer when compared to MP. 
When a 128KB write buffer is used, OPP can fit nearly 
1024 dirty objects in the write buffer while MP can fit 
only around 32 pages containing dirty objects. Hence, 
OPP writes more number of dirty objects to the SSD 
per random write when compared to both MP and SSD- 
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Figure 4: Memcached results. In (a), OPP outperforms MP and SSD-swap by factors of 1.6 and 5.1 respectively (mix of 4byte to 
4KB objects). In (b), SSDAIloc’s use of objects internally can yield dramatic benefits, especially for smaller memcached objects. 
In (c), SSDAIloc beats SSD-Swap by a factor of 4.1 to 6.4 for memcached tests (mix of 4byte to 4KB objects). 


swap (which makes a random write for every dirty ob- 
ject). OPP writes 1013 times more efficiently compared 
to SSD-swap and 31.5 times compared to MP (factors 
independent of SSD make). Additionally, OPP not only 
increases write efficiency but also writes 31.5 times less 
data compared to MP and SSD-swap for the same work- 
load by working at an object granularity and thereby in- 
creases the SSD lifetime by the same factor. 

Overall, OPP trumps SSD-swap by huge gain factors. 
It also outperforms MP by large factors providing a good 
insight into the benefits that OPP would provide over log- 
structured swaps. Such benefits scale inversely with the 
size of the object. For example with 1KB objects OPP 
beats MP by a factor of 1.6—2.8 and with 2KB objects 
the factor is 1.4—2.3. 


5.2 Memcached Benchmarks 


To demonstrate the simplicity of SSDAlloc and its per- 
formance benefits for existing applications, we modify 
memcached. Memcached uses a custom slab allocator to 
allocate values and regular mallocs for keys. We re- 
placed memcache’s slabs with OPP (ssd_oalloc) and 
with MP(ssd_malloc) to obtain two different versions. 
These changes require modifying 21 lines of code out of 
over 11,000 lines in the program. When using MP, we re- 
placed malloc with ssd_malloc inside memcache’s 
Slab allocator (used only for allocating values). 

We compare these versions with an unmodified mem- 
cached using SSD-swap. For SSDs with parallelism we 
create multiple swap partitions on the same SSD. We also 
run multiple instances of memcached to exploit CPU and 
SSD parallelism. Figure 4 shows the results. 

Figure 4(a) shows the aggregate throughput obtained 
using a 32GB Intel X25-E SSD (2.5GB RAM), while 
varying the number of memcached instances used. We 
compare five different configurations — memcached with 
OPP and MP, memcached with one, two and three swap 
partitions on the same SSD. For this experiment we pop- 
ulate memcached instances with object sizes distributed 
uniformly randomly from 4 bytes to 4KB such that the 


total size of objects inserted is 30GB. For benchmarking, 
we generate 1 million memcached get and set requests 
(100% hitrate) each using four client machines that stati- 
cally partition the keys and distribute their requests to all 
running memcached instances. 

Results indicate that SSDAlloc’s write aggregation 1s 
able to exploit the device’s parallelism, while SSD-swap 
based memcached is restricted in performance, mainly 
due to the swap’s random write behavior. OPP (at 8 in- 
stances of memcached) beats MP (at 6 instances of mem- 
cached) and SSD-swap (at 6 instances of memcached on 
two swap partitions) by factors of 1.6 and 5.1 respec- 
tively by working at an object granularity, for a mix of 
object sizes from 4bytes to 4KB. While using SSD-Swap 
with two partitions lowers the standard deviation of the 
response time, SSD-Swap had much higher variance in 
general. For SSD-Swap, the average response time was 
667 microseconds and the standard deviation was 398 
microseconds, as opposed to OPP’s response times of 
287 microseconds with a 112 microsecond standard de- 
viation (high variance due to synchronous GC). 

Figure 4(b) shows how object size determines mem- 
cached performance with and without OPP (Intel X25-E 
SSD). Here, we generate requests over the entire work- 
load without much locality. We compare the aggregate 
throughput obtained while varying the maximum ob- 
ject size (actual sizes are distributed uniformly from 128 
bytes to limit). We perform this experiment for three set- 
tings — 1) Eight memcached instances with OPP, 2) Six 
memcached instances with MP and 3) Six memcached 
instances with two swap partitions. We picked the num- 
ber of instances from the best performing numbers ob- 
tained from the previous experiment. We notice that 
as the object size decreases, memcached with OPP per- 
forms much better than when compared to memcached 
with SSD-swap and MP. This is due to the fact that using 
OPP moves objects to/from the SSD, instead of pages, 
resulting in smaller reads and writes. The slight drop in 
performance in case of MP and SSD-swap when moving 
from 4KB object size limit to 8KB is because the runtime 
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Figure 5: Packet Cache Benchmarks: In (a) we see that SSDAlloc’s runtime mechanism adds only up to 20 microseconds of 
latency overhead, while there was no significant difference in throughput. B+Tree Benchmarks: In (b), we see that SSDAlloc’s 
ability to internally use objects beats page-sized operations of MP or SSD-swap. 


sometimes issues two reads for objects larger than 4KB. 
When the Object Table indicates that they are contigu- 
ous on SSD, we can fetch them together. In comparison, 
SSD-swap prefetches when possible. 

Figure 4(c) quantifies these gains for various SSDs 
(objects between 4byte and 4KB) at a high insert rate 
of 50%. The benefits of OPP can be anywhere between 
4.1-6.4 times higher than SSD-swap and 1.2-1.5 times 
higher than MP (log-structured swap). For smaller ob- 
jects (each 0.SKB) the gains are 1.3-3.2 and 4.9-16.4 
times respectively over MP and SSD-swap (16.4 factor 
improvement is achieved on the Intel X25-V SSD). Also, 
depending on object size distribution, OPP writes any- 
where between 3.88-31.6 times more efficiently when 
compared to MP and 24.71-1007 times compared to 
SSD-swap (objects written per SSD write). The total 
write traffic of OPP is also between 3.88-31.6 times less 
when compared to MP and SSD-swap, increasing the 
lifetime and reliability of the SSD. 


5.3. Packet Cache Benchmarks 


Packet caches (and chunk caches) built using SSDs scale 
the performance of network accelerators [6] and inline 
data deduplicators [14] by exploiting good random read 
performance and large capacity of flash. Similar capacity 
DRAM-only systems will cost much more and also con- 
sume more power. We built a packet cache backend that 
indexes a packet with the SHA1 hash of its contents (us- 
ing a hash table). We built it via two methods — 1) pack- 
ets are allocated via OPP (Sssd_oalloc), and 2) packets 
are allocated via the non-transparent object get/put based 
SSDAlIloc that we describe in our workshop paper [8] — 
where the SSD is used directly without any runtime inter- 
vention. Remaining data structures in both the systems 
are allocated via malloc. We compare these two im- 
plementations to estimate the overhead from SSDAlloc’s 
runtime mechanism for each packet accessed. 

For the comparison, we test the response times of 
packet get/put operations into the backend. We consider 
many settings — we vary the size of the packet from 100 
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to 1500 bytes and in another setting we consider a mix 
of packet sizes (uniformly, from 100 to 1500 bytes). We 
use a 20 byte SHAI hash of the packet as the key that is 
stored in the hashtable (in DRAM) against the packet as 
the value (on SSD) — the cache is managed in LRU fash- 
ion. We generate random packet content from “/dev/ran- 
dom’. We use the Intel X25-M SSD and the high-end 
CPU machine for these experiments, with eight threads 
for exploiting device parallelism. We first fill the SSD 
with 32GB worth of packets and then perform 2 million 
lookups and inserts (after evicting older packets in LRU 
fashion). In this benchmark, we configured the Page 
Buffer to hold only a handful of packets such that ev- 
ery page get/put request leads to a signal raise, and an 
ATM lookup followed by an OPP page materialization. 

Figure 5(a) compares the response times of OPP 
method using the transparent techniques described in 
this paper and the non-transparent calls described in the 
workshop paper [8]. The results indicate that the over- 
head from SSDAlloc’s runtime mechanism is only on 
the order of ten microseconds, there is no significant dif- 
ference in throughput. Highest overhead observed was 
for 100 byte packets, where transparent SSDAlloc con- 
sumed 6.5% more CPU than the custom SSD usage ap- 
proach when running at 38K 100 byte packets per sec- 
ond (30.4 Mbps). We believe this overhead is acceptable 
given the ease of development. We also built the packet 
cache by allocating packets via MP (ssd_malloc) and 
SSD-swap (malloc). We find that OPP based packet 
cache performed 1.3—2.3 times better than an MP based 
one and 4.8-10.1 times better than SSD-swap for mixed 
packets (from 100 to 1500 bytes) across all SSDs. Write 
efficiency of OPP scaled according to the packet size as 
opposed to MP and SSD-swap which always write a full 
page (either for writing a new packet or for editing the 
heap manager data by calling ssd_free or free). Us- 
ing an OPP packet cache, three Intel SSDs can acceler- 
ate a 1Gbps link (1500 byte packets at 100% hit rate). 
Whereas, MP and SSD-swap would need 5 and 12 SSDs 
respectively. 
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Figure 6: HashCache benchmarks: SSDAlloc OPP option can 
beat MP and SSD-Swap on RAM requirements due to caching 
objects instead of pages. The maximum size of a completely 
random working set of index entries each allocation method 
can cache in DRAM is shown (in log scale). 


5.4 B+Tree Benchmarks 


We built a B+Tree data structure via Boost framework [1] 
using the in-built Boost object_pool allocator (which uses 
malloc internally). We then ported it to SSDAIloc OPP 
(in 15 lines of code) by replacing calls to ob ject_pool 
with ssd_oalloc. We also ported it to MP by replac- 
ing all calls to malloc (inside object_pool) with 
ssd_malloc (in 6 lines of code). Hence, in the MP 
version, every access to memory happens via the SSDAI- 
loc’s runtime mechanism. 


We use the Intel X25-V SSD (40GB) for the experi- 
ments and restrict the amount of memory in the system 
to 256MB for both the systems to test out-of-DRAM be- 
havior. We allow up to 25 keys stored per inner node and 
25 values stored in the leaf node, and we vary the key 
size. We first populate the B+Tree such that it has 200 
million keys, to make sure that the height of the B+Tree 
is at least 5. We vary the size of the key, so that the size 
of the inner object and leaf node object vary. We perform 
2 million updates (values are updated) and lookups. 


Figure 5(b) shows that MP and OPP provide much 
higher performance than using SSD-swap. As the key 
size increases from 4 to 64 bytes, the size of the nodes 
increases from 216 bytes to 1812 bytes. The perfor- 
mance of SSD-swap and MP 1s constant in all cases (with 
MP performing 3.8 times better than SSD-swap with log- 
structured writes) because they access a full page for al- 
most every node access, regardless of node size, increas- 
ing the size of the total dirty data, thereby performing 
more erasures on the SSD. OPP, in comparison, makes 
smaller reads when the node size is small and its perfor- 
mance scales with the key size in the B+Tree. We also re- 
port that across SSDs, B+Tree operations via OPP were 
1.4—3.2 times faster when compared to MP and 4.3-12.7 
times faster than when compared to SSD-swap (for a 64 
byte key). In the next evaluation setting, we demonstrate 
how OPP makes the best use of DRAM transparently. 


5.5  HashCache Benchmarks 


Our final application benchmark is the efficient Web 
cache/WAN accelerator index based on HashCache [9]. 
HashCache is an efficient hash table representation that 
is devoid of pointers; it is a set-associative cache index 
with an array of sets, each containing the membership 
information of a certain (usually 8-16) number of ele- 
ments currently residing in the cache. We wish to use 
an SSD backed index for performing HTTP caching and 
WAN Acceleration for developing regions. SSD backed 
indexes for WAN accelerators and data deduplicators are 
interesting because only flash can provide the neces- 
sary capacity and performance to store indexes for large 
workloads. A netbook with multiple external USB hard 
drives (upto a terabyte) can act as a caching server [8]. 
The inbuilt DRAM of 1-2 GB would not be enough to 
index a terabyte hard drive in memory, hence, we pro- 
pose using SSDAlloc in those settings — the internal SSD 
can be used as a RAM supplement which can provide 
the necessary index lookup bandwidth needed for WAN 
Accelerators [16] which make many index lookups per 
HTTP object. 

We create an SSD based HashCache index for 3 bil- 
lion entries using 32GB SSD space. For creating the in- 
dex, HashCache creates a large contiguous array of 128 
byte sets. Each set can hold information for sixteen el- 
ements — hashes for testing membership, LRU usage in- 
formation for cache maintenance and a four byte loca- 
tion of the cached object. We test three configurations 
of HashCache: with OPP (via ssd_oalloc), MP (via 
ssd_malloc) and SSD-swap (via malloc) to create 
the sets. In total, we had to modify 28 lines of code for 
these modifications. While using OPP we made use of 
Checkpointing. This is because we want to be able to 
quickly reboot the cache in case of power outages (net- 
books have batteries and a graceful shutdown 1s possible 
in case of power outages). 

Figure 6(a) shows, in log scale, the maximum number 
of useful index entries of a web workload (highly ran- 
dom) that can reside in RAM for each allocation method. 
With available DRAM varying from 2GB to 4.5GB, we 
show how OPP uses DRAM more efficiently than MP 
and SSD-swap. Even though OPP’s Object Table uses 
almost 1GB more DRAM than MP’s Object Table, OPP 
still is able to hold much larger working set of index en- 
tries. This is because OPP caches at set granularity while 
MP caches at a page granularity, and HashCache has al- 
most no locality. Being able to hold the entire working 
set in memory is very important for the performance of 
a cache, since it not only saves write traffic but also im- 
proves the index response time. 

We now present some reboot and recovery time mea- 
surements. Rebooting the version of HashCache built 
with OPP Checkpointing for a 32GB index (1.1GB Ob- 
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ject Table) took 17.66 sec for the Kingston SSD (which 
has a sequential read speed of 70 MBPS). 

We also report performance improvements from us- 
ing OPP over MP and SSD-swap across SSDs. For 
SSDs with parallelism, we partition the index horizon- 
tally across multiple threads. The main observation is 
that using MP or SSD-swap would not only reduce per- 
formance but also undermine reliability by writing more 
number of times and more data to the SSD. OPP’s per- 
formance is 5.3—-17.1 times higher than when using SSD- 
Swap, and 1.3-3.3 times higher than when using MP 
across SSDs (50% insert rate). 


6 Conclusion 


SSDAlIloc provides a hybrid memory management sys- 
tem that allows new and existing applications to easily 
use SSDs to extend the RAM in a system, while perform- 
ing up to 17 times better than SSD-swap, up to 3.5 times 
better than log-structured SSD-swap and increasing the 
SSD’s lifetime by a factor of up to 30 times with mini- 
mal code changes, limited to the memory allocation part 
of the application code. The performance of SSDAIloc 
based applications is close to that of custom-developed 
SSD applications. We demonstrate the benefits of SS- 
DAlloc in a variety of contexts — a data center application 
(memcached), a B+Tree index, a packet cache backend 
and an efficient hashtable representation (HashCache), 
which required only minimal code changes, little appli- 
cation knowledge, and no expertise with the inner work- 
ings of SSDs. 
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Abstract 


Current approaches to model checking distributed sys- 
tems reduce the problem to that of model checking cen- 
tralized systems: global states involving all nodes and 
communication links are systematically explored. The 
frequent changes in the network element of the global 
states lead however to a rapid state explosion and make 
it impossible to model check any non-trivial distributed 
system. We explore in this paper an alternative: a local 
approach where the network is ignored, a priori: only the 
local nodes’ states are explored and in a separate man- 
ner. The set of valid system states is a subset of all com- 
binations of the node local states and checking validity 
of such a combination is only performed a posteriori, 
in case of a possible bug. This approach drastically re- 
duces the number of transitions executed by the model 
checker. It takes for example the classic global approach 
several minutes to explore the interleaving of messages 
in the celebrated Paxos distributed protocol even consid- 
ering only three nodes and a single proposal. Our local 
approach explores the entire system state in a few sec- 
onds. Our local approach does clearly not eliminate the 
state exponential explosion problem. Yet, it postpones 
its manifestations till some deeper levels. This is al- 
ready good enough for online testing tools that restart the 
model checker periodically from the current live state of 
a running system. We show for instance how this ap- 
proach enables us to find two bugs in variants of Paxos. 


1 Introduction 


At each step of model checking a centralized system, (1) 
one of the traversed states is selected, (i1) an enabled 
event is executed on that state, and (iii) the resulting 
state is added to the list of traversed states. The user- 
specified invariants are checked against the traversed 
states after each step and the set of these states grows 
exponentially with the depth of the exploration, 1.e., the 
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Figure 1: State transition in model checking distributed 
systems. In (a) the classic global approach, the model 
checker creates the entire state space of the global states, 
whereas in (b) our proposed local approach, the net- 
work element is eliminated from the stored states and the 
model checker keeps track of only node local states. 
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length of the sequence of enabled events considered. 
Current approaches to model checking distributed sys- 
tems [7, 8, 18, 19, 14] reduce the problem to that of 
model checking a centralized system (Figure 1). The sets 
explored are global states comprising the local states of 
the nodes involved in the distributed system, 1.e., the sys- 
tem State, as well as the network state involving the ex- 
change of messages. 

The exponential state space explosion problem man- 
ifests itself very quickly in this global approach, which 
makes the model checking of distributed systems practi- 
cally ineffective. This is because the global state changes 
following any small change into a node local state or the 
network state. Consider for instance the celebrated Paxos 
protocol [9], in the simple setting with three nodes where 
exactly one proposes at the start, 1.e., no contention: it 
takes the global model checking approach 1514 s (run- 
ning on a 3.00 GHz Intel(R) Pentium(R) 4 CPU with 1 
MB of L2 cache) to explore the interleaving of messages. 

The starting point of this paper is a couple of simple, 
complementary observations: (1) in the global model 
checking approach, the invariants are checked on each 
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traversed global state, although these invariants are typi- 
cally specified only on the system states, 1.e., the invari- 
ants do not involve the network states [8, 18, 19, 14]; ! 
(2) for checking invariants that are defined on system 
variables, visiting the system part is a priori sufficient. 
Focusing on these states only, and ignoring the net- 
work states, significantly reduces the exploration space 
in comparison to the classic approach where each sys- 
tem state is typically repeated in multiple global states 
that differ only in the network part. 

We present in this paper a local model checking ap- 
proach, which essentially consists in keeping track of the 
traversed local nodes’ states separately by ignoring the 
network, a priori. Combined, these states are sufficient 
for invariant checking. The approach is most effective on 
protocols that involve frequent changes into the network, 
1.e., the nodes have lots of parallel network activities. For 
the Paxos example state space with one proposal, our ap- 
proach explores the entire system state in a few seconds. 
We show that our approach is complete in the sense that 
any violation of a system state invariant that could be de- 
tected by the global approach could be detected by our 
local approach. Two important remarks are however in 
order. 

First, the combination of node states does not induce 
system states that are all valid: the fact that we ignore 
the network element, a priori, means that some combi- 
nations of node states might not occur in a real run. In 
other words, although complete, checking invariants on 
the retrieved system states is unsound since it could re- 
port a violation on an invalid system state. We address 
this problem by, a posteriori, verifying every preliminary 
violation report to make sure the sequence of events lead- 
ing to the corresponding system state could also happen 
in areal run. An invariant violation is then reported to the 
user only if passes this test. If the number of preliminary 
violations is low enough, which turns out to be the case 
in our experiments with Paxos, the performance penalty 
of verifying them becomes negligible. 

Second, although our local approach is several orders 
of magnitude faster than the classic model checking ap- 
proach, the state explosion problem is not eliminated. 
(The cost of invalid states created by our approach, al- 
though low at the start, will anyway eventually domi- 
nate in the general case.) Yet, we believe this can, to 
a large extent, be addressed by online model checking 
tools where the model checker is run for just a short pe- 
riod (a few seconds): in this case, our approach is effi- 
cient enough to search till depths of 20~30 for the Paxos 
example state space. 


'Tn testing, invariants are used to express the high-level properties 
of the system. Including the in-flight messages in invariants, although 
possible in theory, makes defining the invariants too complicated in 
practice. 
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In global model checking approach, visiting the sys- 
tem states is part of the exploration process: the new 
global states (which involve the system states) are ex- 
plored by running enabled events on the previously vis- 
ited global states. Therefore, skipping a system state 
makes the exploration incomplete. In contrast, our lo- 
cal approach separates the exploration of transitions from 
the creation of system states. This makes it possible to 
ignore all system states on which the user-specified in- 
variants can inherently not be violated: for instance, the 
Paxos invariant stipulates that no two decisions should 
be different and all undecided states can systematically 
be eliminated. 

Summary of Contributions. 


e We introduce a new, local approach to model check- 
ing distributed systems. Instead of keeping track 
of global states, we eliminate the network element 
from the model checking states and keep only track 
of node local states. Our approach optimistically 
eliminates the overhead of ensuring soundness of 
every visited state and instead verifies soundness 
only on the states that violate the invariants. 

e Our approach decouples exploration algorithm from 
system state space creation. This feature opens the 
door for optimizations that skip some system states 
without, however, hurting the completeness of ex- 
ploration. We benefit from this aspect in our exper- 
iments by skipping the system states that could not 
violate the Paxos invariant. 

e Having the exploration, system state creation, and 
soundness verification decoupled, the model check- 
ing process can be embarrassingly parallelized to 
benefit from the ever increasing number of cores. 

e We present an efficient implementation of our ap- 
proach and we show how this approach tracks bugs 
in two variants of Paxos, known to be one of the 
most complex distributed algorithms. 


The rest of the paper is organized as follows. 8 2 il- 
lustrates our approach through a simple example. The 
background is recalled in § 3. § 4 presents our approach. 
After presenting the evaluation results in § 5, we contrast 
local model checking approach with related work in 8 6 
and conclude the paper in 8 7. 


2 Local Model Checking: A Primer 


Here we use a simple example to highlight the difference 
between global model checking and our local approach. 
The example we consider here does not attempt to illus- 
trate the performance improvements obtained by our ap- 
proach but aims at explaining the main idea. The exam- 
ple system is a simple distributed tree structure, depicted 
in Figure 2. Node O initiates a message for Node 4 and 
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Figure 2: A simple distributed tree algorithm. Each node 
forwards the message to its children. 
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Figure 3: The global state space of the example tree in 
Figure 2 as explored by a global model checking ap- 
proach. The network element of the global state is rep- 
resented by the set of in-flight messages. Each arrow 
depicts a transition in the model checker from one global 
state to another. The label besides each arrow indicates 
the event that triggers the transition. Although the global 
states inside the rectangles are duplicates, they are not 
joined into one state, for simplicity of presentation. 


changes its state to sent. Each node, upon receiving a 
message, forwards it to its children. Node 4 changes its 
state to received upon receiving the message. 

At each step of global model checking, the model 
checker transitions from a global state to another by run- 
ning an enabled event, such as handling a message. The 
global state contains the network state besides the system 
state, i.e., the local state of all the nodes. The global state 
space of the example system is depicted in Figure 3. The 
initial state of each node is denoted ”-’. The system state 
is shown by concatenating the five states of Nodes 0 to 4. 
The state of Node 0 and 4 is changed to ”s” and ’’r’” after 
the sending and receiving of the message, respectively. 
Each change into the network element causes creation of 
a new global state. As one can observe, the number of 
system states covered by this global state space is much 
less than its size. 

Figure 4 illustrates our local approach on the same ex- 
ample system. Here, the network element, i.e., the non- 
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Figure 4: Local model checking approach on the ex- 
ample tree in Figure 2. The first column indicates the 
changes into the shared network element. The middle 
column shows the set of states of Node 0 to 4. The first 
event is the local event of Node O that generates the mes- 
sage. The generated message is then added to the shared 
network element. At each step, an event is selected and 
is executed on all states of the destination node. The re- 
sultant states are added to the list of visited node states if 
they have not been visited before. The last column shows 
the new system states created after each step. 


essential part for invariant checking, is separated from 
the model checking state. Instead, we keep a shared net- 
work component that receives the generated messages by 
all the transitions in the model checking. Observe that 
the messages added to the network are not removed by 
the executed transitions. This is necessary for the com- 
pleteness of the search, because each message must be 
received by all the states of the destination node, includ- 
ing the node states that will be explored later. 

The last column of the figure depicts the new system 
states created after each step. The system states are cre- 
ated temporarily for the sake of being checked against 
the user-specified invariants. Observe that, in total, only 
4 system states are created in contrast with the 12 global 
states of Figure 3. Moreover, the last system state, 1.e., 
”----r” is invalid since Node 4 could not receive the mes- 
sage before it is sent by Node 0. After an invariant is 
violated on a system state, we run a soundness verifica- 
tion phase to ensure the validity of the system state. 


3 Preliminaries 


We present here a simple model of a distributed system 
and a basic model checking algorithm based on depth- 
first search. The model is later altered in § 4 to explain lo- 
cal model checking algorithm. We then explain the short 
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basic notions: 

N — node identifiers 

S — node states 

M — message contents 

N x M — (destination process, message)-pair 

C =2Nx™ _ set of messages with destination 

A — internal node actions (timers, application calls) 


global state : (L,[) €G, G=2N*° x Qnx™ 
system state (local nodes’ states): LD C Nx S 


(function from N to S) 
in-flight messages (network): J C N x M 


behavior functions for each node : 
message handler: Hy, C (S x M) x (S x C) 
internal action handler : Ha C (S x A) x (Sx C) 


transition function for distributed system : 


node message handler execution : 

((s1, it) (so, c)) € Hu 

(Lo 8 {(n, 81) }, lo B {(n,m)})~ 
(Lo w {(n, S2)}, Io wy C) 


before: 
after: 


internal node action (timer, application calls) : 
((s1, a), (so, c)) € Ha 
before: (Lo & {(n, 51)}, l)~ 
after: (Lo W {(n,s2)},l Wc) 


Figure 5: A simple distributed system model 


run in online model checking, where the model checker 
can benefit from our local model checking approach. 


3.1 System Model 


Figure 5 describes a simple model of a distributed sys- 
tem, taken from [18]. 

System state. The global state of the entire distributed 
system encompasses (1) the system state, 1.e., local states 
of all nodes, and (2) in-flight network messages. We as- 
sume a finite set of node identifiers NV (e.g., correspond- 
ing to IP addresses). Each node n € WN has a local state 
L” € S. A node state encompasses node-local infor- 
mation, such as explicit state variables of the distributed 
node implementation, the status of timers, and the state 
that determines application calls. A network state corre- 
sponds to the set of in-flight messages, /. We represent 
each in-flight message by a pair (NV, /) where N is the 
destination node of the message and M is the remaining 
message content (including sender node information and 
message body). 

Node behavior. Each node in the system runs the 
same state-machine implementation. The state machine 
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has two kinds of handlers: (1) a message handler exe- 
cutes in response to a network message; (i1) an inter- 
nal handler executes in response to a node-local event 
such as a timer and an application call. We represent 
message handlers by a set of tuples Hy,. The condition 
((s1,™m), (S2,c)) € Hyg means that, if a node is in state 
s, and it receives a message m, then it transitions into 
state so and sends the set c of messages. Each element 
(n’,m’) € cis a message with target destination node n’ 
and content m’. ((s1,a@),(s2,c)) € Ha represents the 
handling of an internal node action a € A. An internal 
node action handler is analogous to a message handler, 
but it does not consume a network message. 

System behavior. The behavior of the system specifies 
one step of a transition from one global state (L, I) to an- 
other global state (L’, I’). We denote this transition by 
(L,I) ~»(L’, I’) and describe it in Figure 5 in terms of 
handlers Hy, and H,.7 The handler that sends the mes- 
sage, inserts the message directly into the network state 
I, whereas the handler receiving the message simply re- 
moves it from J. To keep the model simple, we assume 
that transport errors are particular messages, generated 
and processed by message handlers. 

Observations. The following observations can be de- 
rived from the definitions of Hy, and Hy, in Figure 5: 
(1) Except the node in which the event is executed, the 
state of other nodes, 1.e., Lo, is untouched. This implies 
that to execute an event on node n, we require only the 
state of node n; (11) To execute Hay with message m on 
node n, the only required part from the network state is 
tuple (n,m): the rest of the network state, i.e., Jo, is 
untouched. These observations indicate that the entire 
global state of the system is not required to execute a 
handler in the model checker. 


3.2 Global Model Checking 


Global model checking is based on a standard search al- 
gorithm such as bounded depth-first search (B-DFS) for 
tracking invariant violations in the transition system cap- 
tured by relation ~» of Figure 5. The search starts from a 
given global state, which, in the standard approach, is the 
initial state of the system. By executing enabled handlers 
(Hy, and H.,) on the traversed global states, the search 
systematically explores reachable global states at larger 
and larger depths and checks whether the states satisfy 
the given invariant condition. 

Soundness. B-DFS is sound in the sense that all vio- 
lation reports could also occur in a real run of the sys- 
tem. In other words, there is no false positive in the re- 
ported bugs. Moreover, all traversed states are valid and 
could also be created in a real run. The sufficient part 
for soundness, however, is only the reported violations 


*J in the handler definition means disjoint union. 
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Figure 6: The covered state space in model checking by 
(a) a model checker started from the initial global state, 
and (b) an online model checker that restarts periodically 
from the current live system state. The curved line rep- 
resents the states explored by the running system. 


to the developer. We will show later that our local model 
checking is also sound, even though some system states 
created a priori might be invalid. 

Completeness. An exploration algorithm is complete if, 
given enough time and space, it can explore all system 
states. In other words, completeness is satisfied if there 
is no false negative in bug reporting. Although B-DFS 
is complete, due to an inherently limited time budget, in 
practice it can explore only a small fraction of the state 
space of complex algorithms. 


3.3. Online Model Checking 


Due to the state space explosion problem, a model 
checker of a distributed system cannot explore deeper 
than certain steps in a limited time budget. For exam- 
ple, even in the very small state space experiment of Fig- 
ure 10, where only one node proposes once, the model 
checker cannot explore more than 15 events within a 
minute. An online model checker is, on the other hand, 
restarted periodically from the live state of a running sys- 
tem. As a consequence, the model checker has a chance 
to explore more relevant states at deeper levels, instead 
of getting stuck in the exponential explosion problem at 
some very shallow depths. 

Figure 6 illustrates the use of a model checker in par- 
allel with a running system. As one can see, an online 
model checker does not require solving the exponential 
explosion problem completely; it is rather sufficient to 
explore till a depth that is useful for testing purposes. 


4 Local Model Checking 


The architecture of our local model checking approach is 
depicted in Figure 7. In this approach, the model checker 
keeps track of node states separately: set U.S” contains 
all the traversed states of node n. This is enough to 
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Figure 7: In our local approach, the handler execution 
works only on node states and produces new node states. 
Local and system states are denoted ”LS” and ”SS”, 
respectively. The messages are not removed from the 
shared network component after execution. The sound- 
ness verification checks the validity of a system state, 
only after an invariant violation is reported. 


node message handler execution : 

((s1,™), (s2,¢)) € Ay 
(Lio & {(n, 81) }, 17 8 {(n,m)})~ 
(Lo & {(n, s2)}, IT W {(n, m)} Wc) 


before: 
after: 


internal node action (timer, application calls) : 
((s1, a), ($2,¢)) € Hy 
before: (Lo & {(n,s1)},I7)~ 
after: (Lo W{(n,s2)},I7 Wc) 


Figure 8: The altered handlers in local model checking. 


recreate the system states upon which the invariants are 
checked. After a preliminary violation report on a sys- 
tem state, the validity of the system state is checked by 
a soundness verification module. If the system state is 
confirmed to be valid, the error is then (and only then) 
reported to the developer. 


Instead of keeping a separate network state for each 
global state, we keep one single network state J* that 
contains all generated messages during the model check- 
ing (Figure 7). The execution of handlers must change to 
work with the shared network state J* (Figure 8). In the 
new handlers, H}, and H’,, the network state of the in- 
put global state is replaced with the new shared network 
state, 1*. Furthermore, the received message, (n,™), 
is not removed from J* after the execution of handler 
H',,. In other words, the content of J * is always in- 
creasing. It is not hard to see that the altered handlers 
preserve the completeness of the search: for each Transi- 
tion (Ly, I) ~+(Lq, I_) in Hm, there exist a correspond- 
ing Transition (Lp, I) ~+(Lq,17) in Hj,. We discuss 
soundness later in this section. 

Recall from 8 3 that, to execute a handler on node n, 
the only required state is the state of node n, i.e., LS”. 
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1 proc findBugs(liveState, invariant) 


2 
3 
4 
5 


ON 


10 


LS = emptySet(); J* = emptySet(); 
foreach n € N 

LS” = LS"U {liveState” }; 
while ( ! StopCriterion ) 


if (5((s,e), (s’,c)) € Hh, where LS” € LS”, (n,e) € IT 


|| 4((s, e), (s’, c)) € Hy where LS? € LS”) 
addNextState(n, s,s’, e, c, LS); 
checkSystemInvariant(n, s’, liveState, LS, invariant); 


11 proc addNextState(n, s,s’, e, c, LS) 


Iv =I" Ue; 
Do’ =L5" Ws" 
LS .predecessors.add(s, e); 


16 proc checkSystemInvariant(n, s’, liveState, LS, invariant) 


U7 


22 


foreach ss : system state 
where VYnx. ss"* € LS” 
if ( ! invariant(ss) ) 
if ( isStateSound(liveState, ss) ) 
reportBug(ss); // a bug found 


23 proc isStateSound(liveState, state) 


29 
30 


//obtain all sequences following predecessor pointers 
foreach / : list of event sequences where 
h” € (state” predecessors)” //* is closure operator 
if ( isSequence Valid(liveState, h) ) 
return true; 
return false; 


31 proc isSequence Valid(liveState, h) 


32 


33 
34 
35 
36 
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State = liveState; 


. h” .first 
while (Sn, nextState where state Ge 0 


State = nextState; 
h” .popFirst(); 
return h == (); 


nextState) 


Figure 9: Local model checker algorithm. 


Therefore, the stored node states are enough to execute 
the handlers and we do not need to recreate the system 
state for that. To execute network handlers, however, we 
require also message (n,m) from the network (we do not 
need the whole network state.). As shown in Figure 7, the 
handler execution module receives input only from node 
states and the shared network module. 


4.1 Algorithm 


Figure 9 presents our algorithm. Variable LS in Figure 9 
refers to the set of all visited node states, i.e., (r,s), where 
nm is the node index and s is the node state. Procedure 
findBugs takes the live state of the system as input, to 
initialize Variable LS at Lines 3-4. As in global model 
checking , the search terminates upon exceeding some 
bounds, such as running time or search depth (Line 5). 

Handler execution. At each step of the model check- 
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ing, an enabled handler, either network or local, is exe- 
cuted. For network handlers, the algorithm at each step 
checks all network messages in Variable J*. To obtain 
the enabled network events, for each message e of node 
nin IT, all the currently visited states of node n are con- 
sidered (Line 6). The corresponding network handler is 
then executed (Line 8) and Procedure addNextState 
is called on the resultant state, s’, and the set of new net- 
work messages, c. Note that the messages that are added 
to network J* in this round of the loop (i.e., c in Fig- 
ure 8) will be considered on the node states in the next 
round. 

As in the global model checking approach, the node 
local events, such as timers and application calls, are de- 
fined based on the node local states. In other words, the 
value of node state LS!’ determines which of the local 
events are enabled. To obtain the enabled local events, 
we look at all visited node states and retrieve their local 
events (Line 7). 

In Procedure addNextState, the set of new net- 
work messages is added to the shared network, J* 
(Line 12). If the state of node n has changed, it is added 
to set LS (Line 13). Variable predecessors keeps track 
of all the last immediate node states as well as the exe- 
cuted events on them that led to the current node state 
(Line 14). We need more than one pointer in Vari- 
able predecessors, since the same node state might be 
reached by executing different sequences of events. 
Creating system states. The invariants are defined 
on system states. Since we do not store the system 
states, they must be temporarily created for the sake 
of invariant checking, which is performed by Procedure 
checkSystemInvariant. The procedure is called 
after each change to LS. Each system state ss is created 
by combining the node states of different nodes in LS. 
(We will explain in § 4.2 an optimization that prevents 
revisiting system states.) 

The only purpose of system state creation is to verify 
the user-specified invariant 7n on them. Therefore, we 
can design invariant-specific system state creation to by- 
pass the system states that could not possibly violate the 
invariant. In other words, if in’ = in and in’(ss) is 
false, verifying in(ss) is not necessary. In order for this 
to be useful, 7n’ should be cheaply verifiable. One way 
to achieve that is to decompose in’ into some locally ver- 
ifiable properties. For example, the Paxos invariant spec- 
ifies that no two nodes should choose different values. In 
system state creation, therefore, we can ignore the node 
states in which no value is chosen yet. If the invariant 1s 
defined on node states separately, the invariant-specific 
system state creation can also bypass the system states 
in which none of node states have violated the invariant. 
For example, in RandTree distributed tree structure, one 
invariant specifies that in all node states the children and 
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siblings must be disjoint sets. 

Soundness verification. Since taking all combina- 
tions of node states could result into some invalid system 
states, the preliminary violation of an invariant could be 
unsound. Procedure isStateSound, therefore, veri- 
fies validity of the system state upon which an invari- 
ant is violated. Variable predecessors in each node 
state s’ contains all the last immediate node states that 
led to s’. Following these pointers, we obtain the set 
of event sequences that could lead to s’. If a system 
State is valid, then there exists at least one valid com- 
bination of its node states’ event sequences.* Lines 25- 
26 loop on all these combinations and invoke Procedure 
isSequenceValid on each. The number of paths 
could exponentially increase with sequence size, which 
is the major cost in soundness verification. 

Procedure isSequenceValid receives n event se- 
quences (h’, i € N) corresponding to n nodes in the 
system. The procedure then looks for a valid total order 
for execution of the events, in which an event is executed 
only after it is enabled. For example, to execute a net- 
work handler that receives message m from node s, the 
message must first be generated by an event in s. At each 
step, the procedure verifies whether any of the events on 
top of the h* stacks are enabled (Line 33). The first en- 
abled event is greedily selected for execution based on 
the definition of handlers in Figure 5 (the events are ex- 
ecuted similar to a real run of the distributed system.). 
The loop continues until there are no enabled events on 
top the h* stacks. Afterward, the fact that h is empty 
(Line 36) indicates that the set of sequenced events in h 
was possible to run and hence its corresponding system 
state is valid. 

Procedure i1sSequenceValid returns true if and 
only if the corresponding input system state is valid. The 
proof of the above statement is covered in the technical 
report [16]. Intuitively, since an event in not popped out 
from / unless it is a valid, enabled event, the feasibil- 
ity of executing all events implies that the system state 1s 
valid. It actually does not matter which enabled event is 
selected for the next step, since the demanded order by 
the sequences will be eventually enforced by receiving 
only the messages that are already generated. 


4.2 Implementation Details 


Local model checking can be used for testing programs 
in all languages, including C++. Basically, any of exist- 
ing stateful global model checking tools could be instru- 
mented to run our proposed algorithm. Our prototype 


Each event sequence must deterministically lead to the same node 
state. If the event handler implementation is dependent on some non- 
deterministic values, those values must be recorded as part of the event, 
to be replayed deterministically on a re-execution of the event. 


implementation of the local model checking approach, 
denoted LMC, uses MaceMC [8], a model checker for 
distributed system implementations in the Mace lan- 
guage [7]. Mace programs are basically structured C++ 
implementations, in which the boundary of handlers and 
the protocol messages need to be specified. This helps 
Mace automatically generate the code for serialization 
and deserialization of the protocol state, and simplifies 
the definition of events in the model checker. 

We use CrystalBall [18, 17] for online running of the 
model checker, in parallel with a live distributed system. 
The model checker is then periodically restarted from the 
taken snapshot. It is worth noting that LMC improves the 
performance of model checking anyway, independent of 
CrystalBall. For testing of complex programs, however, 
we use the online model checking approach to restart the 
model checker before exponential explosion manifests. 

We changed MaceMC to work only on one global ob- 
ject of the network simulator, i.e., J*. To change the 
network handler implementations from Hj, to H}, (Fig- 
ure 8), we changed the network simulator not to remove a 
message after its delivery. MaceMC automatically gener- 
ates specific functions for (de)serializing a module state 
in the service. We added specific functions to save and 
restore the whole service stack. This is required for 
multi-layer services such as 1Paxos [15] (one of the pro- 
tocols we check), which uses Paxos as its lower layer 
module. To efficiently check for duplicate states, we use 
the hashes of the serialized states. For each node n, the 
hashes of the traversed states are kept in a Set structure. 
The serialized state itself is stored in a deque structure 
to benefit from its efficiency in random access. 

Each message keeps track of the number of node 
states on which it has been executed. Therefore, in each 
round, each message is checked only on the newly added 
states, by jumping over the old states. Instead of the ac- 
tual event, its hash is added into the predecessor point- 
ers. These hash values will be checked against the hash 
values of the enabled events, later when we verify the 
soundness of the system state. 

Test driver. The test in model checking a service is 
generally driven by an application sending requests to 
the service. In Paxos for example, an application send- 
ing propose requests to the service is the test driver of 
the model checker. The more complex the test driver, the 
larger the generated state space is. A careful design of the 
test driver could greatly impact the efficiency of model 
checking. In our Paxos experiments, the test driver pro- 
poses values for a particular index. The index is selected 
from recent chosen proposals, where not all the nodes 
have learned the proposal yet. Otherwise, a new index is 
used for the proposal. 

System states. To avoid revisiting system states, check- 
ing invariants on system states is performed only after 
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visiting a new node state, which implies the possibility 
for creating new system states. For each new node state 
(n,s), the system states are created by iterating over the 
states of all the nodes except node n and loading them. 
This is because the combinations of the previously vis- 
ited states of node n and the node states of the other 
nodes have already been verified in previous rounds. It is 
worth noting that this optimization could make the model 
checking incomplete because the handler execution that 
has not produced a new node state could still change the 
pointers in predecessors, which means the possibility of 
a valid event sequence for a previously rejected system 
state. To address this issue we could cache the system 
states in which an invariant is violated and reverify them 
after the changes into LS’ that affect them. 


Beside the general approach for system state creation, 
we also implemented an invariant-specific variation, de- 
noted LMC-OPT, optimized for the Paxos main invari- 
ant. In this variation, we map the node states to the values 
that are chosen in them. Because most of the node states 
have not chosen any value, lots of them will not be in- 
cluded in this mapping. When creating system states, we 
thus select only the node states that at least two of them 
are mapped to different values. This optimization helps 
avoid the creation of lots of redundant system states and 
consequently omits their corresponding invariant check- 
ing and soundness verification steps. 

Soundness verification. Procedure isStateSound 
uses pointers in Variable predecessor to find event se- 
quences that could lead to the input node states. For the 
sake of simplicity in implementation, we ignore the se/f- 
references in following the pointers in predecessor. Al- 
though in theory this could make the exploration incom- 
plete, in practice the search in the limited time budget is 
incomplete anyway and benefiting from the simplicity is, 
hence, preferable. Moreover, after the soundness verifi- 
cation on a system state is finished, some more pointers 
could be added into predecessor by the process of lo- 
cal model checking. Therefore, a complete exploration 
should invoke soundness verification after each change 
into a predecessor. However, an efficient implementa- 
tion of that would be complex since it should check only 
for the newly added pointers. For the sake of simplic- 
ity in implementation, we invoke soundness verification 
only after a new node state is visited. 

Procedure isSequenceValid. The validity of a set 
of sequenced events could in general be checked by ex- 
ecuting them in a simulator (the same way the global 
model checking approach transitions from one global 
state to another). If no event from the sequences is en- 
abled in the simulator, it indicates that sequence of events 
is not valid. Although using the simulator simplifies the 
implementation, initializing the simulator at each run of 
the soundness verification is expensive since it involves 
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loading the test driver. 

For efficient implementation of soundness verification 
module, we take advantage of the following observation. 
The role of the simulator in executing event e on node n 
is to (i) updates the state of node n, (11) remove the mes- 
sage m from the network if e is a network event for de- 
livery of message m, and (ii1) add the set c of messages, 
resulting from the execution of e, to the network. 

The consumed message by a network event is specified 
by its corresponding hash in the node event sequence, 
which was given as a part of the input to the procedure. 
The set of the generated messages by an event execution 
can also be remembered by keeping the hashes of the 
generated messages in predecessor. In this manner, the 
input to Procedure isSequenceVal1idis the set of se- 
quenced events as well as the set of generated messages 
by each event. The execution of event e in Procedure 
isSequenceValidcan then be simplified as follows: 


1. A local event e is always enabled. A network event 
e is enabled if the hash of the required message is 
found in the set of generated message hashes, net. 

2. If event e is enabled, then pop it out from the se- 
quence. If event e is a network event, remove the 
hash of the corresponding message from set net. 

3. After popping out event e, add its generated mes- 
sage hashes to set net. 


The above implementation simplifies Procedure 

isStateSound to some integer comparison opera- 
tions and therefore makes checking the validity of a set 
of sequenced events very efficient. 
Local assertions. LMC checks for the system invari- 
ants defined on the system state. The source code could 
still be instrumented by some local assertions by which 
the developers have benefited in earlier stages of testing. 
The violation of the local assert statements in the pro- 
cess of local model checking could imply that either (i) 
the node state is invalid, perhaps because of delivering an 
unexpected message, or (i1) there is a bug in the system 
under test. Checking the latter case necessitates (1) cre- 
ating all the system states by combining the node state 
with all states from other nodes, and (11) checking the va- 
lidity of those states by invoking soundness verification. 
This approach is very expensive since it involves lots of 
invocation of soundness verification. 

In general we could ignore violation of a local assert 
since a protocol bug will eventually manifest itself by vi- 
olating a system invariant. Alternatively, we can discard 
the node state on which the assertion is violated assum- 
ing that the assert violation implies the invalidity of the 
node state. In the applications we tested, the assert state- 
ments were mostly used to exclude the receipt of unex- 
pected messages, i.e., the case that could be caused by 
conservative message delivery policy of LMC, which de- 
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livers the message to all the node states of the destination. 
We, therefore, benefited from the local assert violations 
by discarding the corresponding node states. 

Local events. The presented algorithm in § 4 is com- 
plete in the sense that, given enough time and space, it 
explores all possible states. In practice, however, we 
have a short time budget to check the reachable states 
from a given current state. Therefore, the developers 
might be interested to favor some events to be explored 
first in the search. Hence, in each round we put a bound 
on the number of local events that each node can exe- 
cute; after finishing the round, the bounds are increased 
and the model checking is started from scratch. This ap- 
proach is in spirit similar to B-DFS search, where the 
search depth is increased at each step. 

Duplicate messages. In general, a node could infinitely 
issue duplicates of the same message. For example, in 
the verified Paxos implementations, the same Chosen 
message will be sent over and over to the proposer that 
insists for an already chosen value. To favor the main 
protocol messages in the limited time of search, we have 
put a limit on the number of duplicate messages sent 
from a source to a destination node. This limit is set 
to zero for the results reported in this paper. Note that 
the duplicate messages can be postponed to be processed 
later, after processing some main protocol messages. 

As we explained, to ensure completeness, the mes- 
sages are never erased from the network object, I~. 
However, if node state s~4s’ where m is a network 
event, execution of m on s’ is redundant since m is al- 
ready executed in the sequence. To avoid redundant ex- 
ecutions, we keep the history of the messages that has 
been executed to obtain the state: a network event is con- 
sidered on a state only if it is not in the history of the 
state. After executing message m on node state s that re- 
sults into node state s’, we apply the two following rules 
to maintain the history: (i) s’.history = s.history, (il) 
s’ history.addLast(m). Thus, message m will never 
be executed on node state s’ as well as its descendants. 
Maintaining history gets complicated if state s’ already 
exists since we need to maintain separate histories for 
different sequences that lead to s’. We have simplified 
the implementation by applying rule (1) only if the state 
does not exist. Since the run of LMC in the limited time 
budget is not complete anyway, we decided to favor sim- 
plicity over completeness here. 


4.3. Scope of Applicability 


In contrast with global model checking that validity of 
each traversed state is ensured, local model checking op- 
timistically allows visiting invalid states and verifies the 
validity of a state only after it violates an invariant. If 
we have a few preliminary violations, the optimistic ap- 


proach of local model checking performs better since it 
does not pay for ensuring validity of every single visited 
state. Otherwise, the cost of soundness verification dom- 
inates. For example, in online model checking, if a run 
of the model checker is revealing a bug in the protocol, 
it is likely to see lots of violation reports caused by both 
valid and invalid event sequences. Perhaps, one solution 
could be running both local and global model checker in 
parallel and use the result of the one that finishes sooner. 

By eliminating the network element from the model 
checking state, local model checking reduces the ex- 
plored state space since each system state is repeated in 
multiple global states that are different only in the net- 
work part. The larger the network state space is, the more 
space and time is saved by eliminating it. Local model 
checking is, therefore, most effective for the protocols 
that are chatty, 1.e., exchange lots of messages to service 
a request. Otherwise, if the nodes rarely communicate, 
the change into the network is rare and therefore there is 
not much to be saved by local model checking. 

In contrast with global model checking, local model 
checking considers interleaving of parallel network 
events only when they turn out to be dependent. LMC, 
therefore, avoids lots of unnecessary event interleaving. 
For example, upon receipt of the Accept message, the 
nodes in Paxos broadcast some Learn messages in par- 
allel, which enables LMC to perform much better than 
global model checking. The more parallel network ac- 
tivities in the system, the more effective LMC is. For ex- 
ample, we could not expect much from LMC in a chain 
system in which each node simply forwards the input 
message to the next. 

The current implementation of LMC assumes a best- 
effort, lossy network, 1.e., IP. The protocols that use UDP 
can, therefore, be directly model checked with LMC. AI- 
though, TCP could be considered as part of the protocol 
stack, in practice this is not efficient, and TCP is usu- 
ally simulated in the model checker. To do so, LMC im- 
plementation should be also augmented to benefit from 
the fact that reordered messages in a connection will 
eventually be rejected by TCP and could, hence, be ig- 
nored, saving some unnecessary handler executions in 
the model checker. 


5 Evaluation 


We evaluate in this section the performance of our local 
model checking approach compared to a classic global 
one. We also illustrate the ability of our tool, LMC, in 
finding bugs in Paxos and its variant, 1Paxos. 

We use Paxos as a complex distributed testbed to eval- 
uate the performance of the proposed local model check- 
ing approach. In usual implementations of Paxos, each 
node implements three roles: proposer, acceptor, and 
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Figure 10: The elapsed time in model checking Paxos 
where only one out of three nodes proposes a value. 


learner. Multiple proposers can concurrently propose 
values for the same index. The Paxos invariant (also 
known as the Paxos safety property) stipulates that no 
two nodes will choose different values for the same in- 
dex. A proposition (1.e., proposing a value for an index) 
starts by broadcasting Prepare messages to the accep- 
tors. The acceptors respond by a PrepareResponse mes- 
sage. After receiving it from a majority of acceptors, the 
proposer broadcasts an Accept message to the acceptors. 
The value in the Accept message is the value returned 
by the PrepareResponse message with the highest pro- 
posal number, which reflects the accepted values from 
previous proposals, if there is any. Each acceptor then 
broadcasts a Learn message to the learners. A value is 
chosen by the learners after receiving the Learn message 
from a majority of acceptors. 

For benchmarking purposes, we use a state space of 
Paxos running between three nodes, in which one node 
proposes a value once and the others react to this pro- 
posal by communicating using Paxos messages. The 
long chain of messages following each proposal could 
be received in a variety of orders, which all must be con- 
sidered by a model checker. For each experiment, we re- 
port on evaluation of 3 algorithms: (1) B-DFS (explained 
in § 3), Gi) LMC-GEN, which is the non-optimized, 
general version of our local model checker (LMC), and 
Gil) LMC-OPT, which is a version of our local model 
checker optimized for the Paxos main invariant accord- 
ing to § 4.2. The experiments are run on a 3.00 GHz 
Intel(R) Pentium(R) 4 CPU with 1 MB of L2 cache. 


5.1 LMC Speedup 


Here we evaluate the speedup in model checking that 
we can get by our tool, LMC. Figure 10 presents the 
results for the example state space, in which only one 
node proposes a value. This state space is relatively 
small and yet effective in finding bugs when it is ex- 
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Figure 11: The number of explored states. The number 
of system states explored by LMC-OPT is zero and is, 
hence, not plotted in the figure. 


plored through an online model checker. The depth of 
the state space is 22 events (three initialization, one pro- 
pose local event, three Prepare messages, three Prepar- 
eResponse messages, three Accept messages, and nine 
Learn messages). LMC explores also longer sequences 
of events (up to 25) since it could also explore some in- 
valid sequences of events. * The elapsed time is depicted 
in a logarithmic scale to illustrate exponential state space 
explosion problem. In B-DFS, the exponential explosion 
starts from the very early steps, which makes the explo- 
ration take 1514 s. The growth in LMC-OPT is much 
less steep, which allows it to finish the model checking 
in just 189 ms (~-8,000 times faster than B-DFS). 

The growth in LMC-GEN, although still much more 
gentle than B-DFS, is steeper than LMC-OPT. The ex- 
ploration finishes in 5.16 s which is still ~300 times 
faster than B-DFS. The extra delay is due to the cre- 
ation of the system states out of the explored node states, 
which in LMC-OPT is optimized to be performed only 
after a different value is chosen. Figure 11 depicts the 
number of explored states. The number of created system 
states in LMC-GEN, although much less than B-DFS, is 
much more than the total number of node states, denoted 
LMC-local in the figure. LMC-OPT, on the other hand, 
drops the number of created system states to zero since 
there is no bug in the Paxos implementation to lead to 
any preliminary violations. (LMC-OPT creates a system 
state only if it is likely to invalidate the invariants.) 

The total number of performed transitions in B-DFS 
is 157,332. LMC drops this to 1,186, which is ~132 
times less. This is because a LMC transition from state 
s to state s’ in node n, is redundantly executed several 
times in global model checking approach (once for each 
global state that encompasses s and its network event is 
enabled). 


+The invalid sequences will be eventually rejected by soundness 
verification phase if they violate some invariants. 
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This state space of Paxos is very useful in online 
model checking, where we expect the model checker to 
seek for a bug in the time budget of less than a minute. 
Both LMC-OPT and LMC-GEN can finish this state 
space in this duration and LMC-OPT can continue for 
more complicated state spaces where there is some time 
left (as we explained in § 4.2, the model checker, in favor 
of time, starts with small state spaces by gradually in- 
creasing the number of allowed local events.). This is in 
contrast to B-DFS that will not go further than depth 12 
within a minute. 


5.2 LMC Scalability Limits 


We showed that LMC manages to finish a valuable state 
space in less than a few seconds. This is already good 
enough for practical applications such as online model 
checking that restarts the model checker every few sec- 
onds. From the theoretical point of view at least, it is 
interesting to find the scalability limits of LMC, 1.e., the 
point where the postponed exponential explosion prob- 
lem eventually manifests and makes LMC ineffective for 
the rest of the exploration. To this aim, we choose a much 
bigger state space, where two separate nodes propose 
two values. The depth of the state space is 41 events, 
which is two times the events in one error-free proposal. 
(LMC explores also longer sequences of events, up to 68, 
since it could also explore invalid sequences of events.) 

Due to exponential explosion problem, neither B-DFS 
nor LMC could finish the state space, even after hours 
of running. Within this duration, B-DFS explores till 20 
steps (out of maximum depth of 41) and LMC searches 
till 39 steps (out of maximum depth 68). The major con- 
tributor to the slowdown of LMC is the expensive task 
of soundness verification. The number of different event 
sequences that must be considered for checking validity 
of a system state exponentially increases with the search 
depth. In the above example that the search depth of 
LMC is 39, each invocation of soundness verification in- 
duces ~10 s into the algorithm. Invocations of sound- 
ness verification are much less in the smaller state space 
in which only one node proposes a value. 


5.3. LMC Memory Requirements 


Figure 11 depicts the very fact that the number of node 
states explored by LMC is much less than the total num- 
ber of system or global states. Because LMC keeps track 
only of node states, and the system states are created 
only temporarily, LMC is expected to require very low 
memory footprint. Figure 12 verifies this expectation by 
depicting the memory footprints of different algorithms. 
LMC-local denotes the run of LMC-OPT in which the 
creation of system states is disabled. The difference be- 
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Figure 12: The consumed memory. The numbers for all 
configurations of LMC are close together and are, hence, 
overlapped in the figure. 
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Figure 13: The overheads of LMC in model checking 
Paxos in which a bug is injected. 


tween LMC-local and LMC-OPT (resp. LMC-GEN) in- 
dicates the memory overhead of system state creation 
as well as soundness verification. Although there is a 
marginal overhead for system states, the memory eventu- 
ally returns to the system by reusing the deleted objects. 
The consumed additional memory by all algorithms is 
less than 1 MB which can totally fit into the L2 cache. 
However, the exponential trend in memory consumption 
of B-DFS, promises the ineffectiveness of B-DFS for 
deeper searches. LMC in contrast uses the memory very 
efficiently (~200 KB in total) and this amount grows lin- 
early by increase in search depth. 


5.4 LMC Overheads 


Here we break down the overheads that limit the scala- 
bility of LMC. LMC has two major overheads: (1) cre- 
ation of system states out of traversed node states, and (2) 
verifying soundness of the preliminary violations. The 
precise load of each overhead depends on the particular 
system under test. Figure 13 illustrates the overheads 
of LMC-OPT in the buggy implementation of Paxos, 
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for which the corresponding bug is reported in 8 5.5. 
In LMC-system-state the soundness verification phase 
is disabled and in LMC-explore the creation of system 
states is eliminated. 

The difference between LMC-system-state and LMC- 
explore captures the overhead of creating the system 
states and checking the invariant on them. The overhead 
is zero until 21 steps since the unnecessary system states 
are bypassed by the optimization in LMC-OPT. After- 
wards, the overhead increases with the depth search, be- 
cause as the exploration moves forward, more node states 
are explored and hence more combinations of them must 
be considered for system state creation. The difference 
between LMC-OPT and LMC-system-state reveals the 
overhead of soundness verification. (LMC-OPT did not 
go further than 28 steps, the level at which the injected 
bug is rediscovered.) This overhead is the major contrib- 
utor to the exponential increase in model checking time. 
The reason is that not all combinations of node states are 
valid, and the more node states are traversed, the more 
invalid system states will be checked. On the other hand, 
since the injected bug is close to manifest in this run of 
the model checker, the number of invalid combinations 
of node states that violate the invariant increases. LMC- 
OPT triggers the soundness verification for 773 times, 
and each call takes 45 ms in average. Overall, 427,731 
different event sequences were checked by the soundness 
verification module. 


5.5 Testing Paxos 


In this section, we report on our experiments in inject- 
ing a bug into a Paxos implementation and then running 
our prototype to verify its ability to detect the bug. The 
bug we injected was reported in a previous implementa- 
tion of Paxos [10]: once the leader receives the Prepar- 
eResponse message from a majority of nodes, it creates 
the Accept request by using the submitted value from 
the last PrepareResponse message instead of the Prepar- 
eResponse message with highest round number. The in- 
stalled invariant is the original Paxos invariant: no two 
nodes can choose different values. 

Every one minute, the online model checking frame- 
work takes the live system state of a running Paxos appli- 
cation and use that to initialize the next run of LMC. The 
application encompasses three nodes, each node pro- 
poses its Id for a new index and then sleeps for a random 
time between O and 60 s. The nodes communicate using 
UDP and 30% of non-loopback messages are randomly 
dropped to allow rare states to be also created. 

The bug was detected after 1150 seconds. The run of 
LMC that detected the bug was initialized with the fol- 
lowing live state: for index k;, node N, has proposed 
value v;, nodes Ny and N»2 have accepted this proposal, 


NSDI 711: 8th USENIX Symposium on Networked Systems Design and Implementation 


but due to message losses only Nj, has learned it. Start- 
ing from this system state, LMC detected in 11 s a vi- 
olation of the Paxos invariant in the following scenario: 
Ng» proposes a new value v2 but its Prepare messages 1s 
not received by N;. No responds by a PrepareResponse 
message containing value v;, because this value was ac- 
cepted by Np» in the previous round. However N3, since 
had not accepted any value for index k;, responds back 
by the same value proposed by No, ve. Receipt of Pre- 
pareResponse of V3 triggers the bug, and N»2 broadcasts 
an Accept message for v2 instead of v;. Eventually this 
leads to choosing value v2 in No, which is different from 
the value chosen by Nj, 1e., v1. 


5.6 Testing 1Paxos 


In this section, we report on running our prototype to 
find bugs on a variant of Paxos, denoted 1Paxos [15]: 
this is an efficient variation of Multi-Paxos [2] that uses 
only one acceptor. Upon failure, the active acceptor is 
replaced with a backup acceptor by the global leader. 
Therefore, it is necessary that the acceptor and leader 
roles to be assigned to two separate nodes. To uniquely 
identify the global leader and the active acceptor, 1 Paxos 
uses a Separate consensus protocol referred to as PaxosU- 
tility [15]. The global leader and the active acceptor are 
identified by the last LeaderChange and AcceptorChange 
entries in the PaxosUtility, respectively. In this experi- 
ment, we have implemented PaxosUtility using Paxos it- 
self. 1Paxos is more complex than Paxos for it comprises 
more logic. Here we use the same setup that was used for 
testing Paxos, with the difference that the application in- 
stead of proposing a value triggers the fault detector with 
the probability of 0.1 to stress the fault tolerance mecha- 
nisms of 1Paxos. In 225 s, the tool found one new bug in 
1Paxos that we report in the following. 

The bug was created because of the wrong usage of the 
++” Operator; if the operator is used after the operand, 
the returned value is the original value and not the in- 
creased one. The developer had made this mistake in 
the initialization function, where the leader is set to 
the first node of the members and the acceptor is set 
to the second. The used command was acceptor = 
x (members.begin()++) which makes the acceptor 
be the same node as the leader. The bug is of course 
fixed by putting the ’++” operator before the operand, 
ie.,acceptor =«(++members.begin() ). 

During the live run, node N3 attempts to be the leader 
by inserting a LeaderChange entry into the PaxosUtility. 
At this moment, it obtains from the PaxosUtility the cor- 
rect value of the active acceptor, which is No. After N3 
becomes leader, it proposes value vz for index k;, which 
is accepted by the acceptor, 1.e., No. No then broadcasts 
a Learn message, which is received by N3 as well as it- 
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self. At this point the live system state, in which all nodes 
except Ny; have chosen value v3 for the index k;, is taken 
to be used by LMC. 

Starting from the above system state, LMC highlights 
the following scenario that violates the Paxos invariant: 
N,, which still assumes it is the leader, proposes value 
v1 for index k; to the acceptor. Since Nj considers itself 
to be the leader, according to the protocol, it does not re- 
fer to PaxosUtility to get the acceptor Id. Therefore, N, 
uses its current value, which is set to NV, 1.e., its own Id, 
due to the initialization bug described above. N; accepts 
the proposal and sends a Learn message to Ny. Upon re- 
ceiving the loopback message, NV; assumes value v1 as 
chosen for index k;. This violates the Paxos invariant 
since other nodes have chosen a different value, i.e., v3. 


6 Related Work 


Cartesian abstraction. This is an abstraction-based 
verification technique where an overapproximated vari- 
ant of the program is model checked, instead of the origi- 
nal one [1]. Due to overapproximation, the reported bugs 
are not sound, which makes the technique mainly useful 
for correctness proving, benefiting from the complete- 
ness of the search. Malkis et al. [11] achieved thread- 
modular model checking [5, 12] using a Cartesian ab- 
Stract interpretation of multi-threaded programs. Each 
thread state consists of the thread local variables plus the 
global variables. For each thread, the model checker sep- 
arately explores possible valuations of the thread local 
variables as well as the global variables. The approxima- 
tion comes from the fact that the valuations of the global 
variables by a thread are also used by other threads, ig- 
noring the causal order for obtaining them. Again, the 
unsoundness, stemmed from the approximation, makes 
the technique inappropriate for testing purposes. In con- 
trast, our reported bugs are sound and this is ensured 
by keeping track of the events executed for obtaining a 
node state and checking the validity of the combination 
of these histories after a preliminary invariant violation 
report. 

We also make use of the Cartesian product of indepen- 
dently explored node states to obtain the system states. 
Cartesian abstraction is essential here in our approach in 
order to create the system states and check (system-wide) 
invariants against them. In contrast, previous works ben- 
efited from the Cartesian abstraction by not creating sys- 
tem states; skipping the system states is possible since 
the invariants in multi-threaded programs are just thread- 
local assert statements and could be verified on a local 
state of a thread without having the rest of the system 
state. > Our local model checking approach employs the 


>There is an ongoing research to convert a system-wide invariant to 


Cartesian abstraction in a different way: namely, to ex- 
plore the system state space without exploring the global 
State space. 

In [6], Cartesian Abstraction is used on top of boolean 
abstraction of threads to find race conditions in multi- 
threaded programs. After boolean abstraction, each 
thread is represented by a long boolean expression over 
global and local variables including an artificially added 
variable for line number. A race condition is also rep- 
resented by a boolean expression over the line numbers 
in which the threads read and write the global variables. 
Race conditions are detected by taking conjunction of the 
thread boolean expressions with race conditions. There- 
fore, there is no need for system state creation. This ap- 
proach cannot be applied on general system invariants 
that would express a relation between local variables of 
multiple threads. The approach applies a heuristic on the 
detected races to eliminate some of the false positives. 

One could indeed generalize the Cartesian abstract in- 
terpretation presented in [11] to distributed systems, by 
using the network as the global object. However, the net- 
work would still be part of the model checking states, 
concatenated to the local states. In our approach, we ex- 
clude the network element from the model checking state 
and use only a shared network element. 

Monotonic abstraction. Monotonic abstraction [13] 
of the network has been used in verification of security 
protocols since it accounts for the maximal knowledge 
learned by attacker. Dolev-Yao’s model [4] is one such 
model, in which the attacker remembers all messages 
that have been intercepted or overheard. The shared net- 
work object in our local model checking approach is es- 
sentially an application of a monotonic abstraction since 
the delivered messages are not removed from the net- 
work. The shared monotonic network is key to ensuring 
the completeness of the search by applying the generated 
messages also on future generated node states. 

Online model checking.  CrystalBall [18, 17] is a 
framework that implements the online model checking 
scheme. To be effective in practice, the online model 
checker must be fast enough to explore till a reasonable 
depth in the period between two restarts (typically a few 
seconds). CrystalBall uses a heuristic, namely Conse- 
quence Prediction, which prunes the local events of an 
already visited node state. As a heuristic, Consequence 
Prediction is incomplete and could, hence, miss some 
bugs due to false negatives. In contrast, our local model 
checking approach offers a complete search accompa- 
nied with proofs. Furthermore, complex distributed sys- 
tems such as Paxos, often generate lots of network mes- 
sages on which Consequence Prediction does not have 
any effect. For instance, in the used Paxos state spaces 


a set of thread-local assert statements, which has shown good results 
on small multi-threaded programs [3]. 
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throughout this paper, we consider only the interleav- 
ing of the resulting network messages after some pro- 
posals. Therefore, Consequence Prediction, which does 
not prune the network messages, would not offer any 1m- 
provement over B-DFS. 


7 Concluding Remarks 


We introduce a novel, local approach to model check- 
ing distributed systems. Essentially, the underlying idea 
is to remove the network state from the global state 
when model checking, and focus on the remaining sys- 
tem state, which is the usual required part for invariant 
checking. The system state is itself built temporarily out 
of node states, and these are maintained separately. Al- 
though complete, the approach is not sound in the sense 
that some system states could be invalid, i.e., could not 
have been produced by an actual run of the system. We 
check the soundness of the system state, a posteriori, 
only if an invariant is violated. 

By removing the network from the global states, our 
local model checking approach creates much less sys- 
tem states than in the global approach. In addition, 
and in contrast with the latter approach, in which vis- 
iting the system states is an inherent part of the explo- 
ration process, local approach separates the exploration 
of transitions from the actual creation of system states. 
This makes it possible to exploit the specificities of the 
user-specified invariants and a priori eliminate all system 
states on which these invariants cannot be violated. 

Clearly, the state exponential explosion problem is 
not eliminated in our approach, and it indeed eventually 
manifests, especially because of invalid system states. 
Yet the problem is postponed and this makes our local 
approach an adequate match for online model checking 
that restarts the model checker periodically. Using on- 
line model checking augmented with our local approach, 
we found a previously reported bug in a traditional Paxos 
implementation, as well as a new bug in a recent variant 
of Paxos. Both bugs have been identified by focusing on 
a simple, arguably common case, namely the case with 
no contention for which distributed protocols are typi- 
cally optimized and hence error-prone. 

For future works, one can think of methods to auto- 
matically prune the system states according to a given 
invariant. In addition, the low memory consumption of 
our approach brings potentials for techniques that trade 
memory for CPU, gaining more speedup. 
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Abstract 


As the cloud era begins and failures become com- 
monplace, failure recovery becomes a critical factor in 
the availability, reliability and performance of cloud ser- 
vices. Unfortunately, recovery problems still take place, 
causing downtimes, data loss, and many other problems. 
We propose a new testing framework for cloud recovery: 
FATE (Failure Testing Service) and DESTINI (Declara- 
tive Testing Specifications). With FATE, recovery is sys- 
tematically tested in the face of multiple failures. With 
DESTINI, correct recovery is specified clearly, concisely, 
and precisely. We have integrated our framework to 
several cloud systems (e.g., HDFS [33]), explored over 
40,000 failure scenarios, wrote 74 specifications, found 
16 new bugs, and reproduced 51 old bugs. 


1 Introduction 


Large-scale computing and data storage systems, includ- 
ing clusters within Google [9], Amazon EC2 [1], and 
elsewhere, are becoming a dominant platform for an 
increasing variety of applications and services. These 
“cloud” systems comprise thousands of commodity ma- 
chines (to take advantage of economies of scale [9, 16]) 
and thus require sophisticated and often complex dis- 
tributed software to mask the (perhaps increasingly) 
poor reliability of commodity PCs, disks, and memo- 
ries [4, 9, 17, 18]. 

A critical factor in the availability, reliability, and per- 
formance of cloud services is thus how they react to fail- 
ure. Unfortunately, failure recovery has proven to be 
challenging in these systems. For example, in 2009, 
a large telecommunications provider reported a serious 
data-loss incident [27], and a similar incident occurred 
within a popular social-networking site [29]. Bug repos- 
itories of open-source cloud software hint at similar re- 
covery problems [2]. 

Practitioners continue to bemoan their inability to ad- 
equately address these recovery problems. For exam- 
ple, engineers at Google consider the current state of 
recovery testing to be behind the times [6], while oth- 
ers believe that large-scale recovery remains underspec- 
ified [4]. These deficiencies leave us with an important 
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question: How can we test the correctness of cloud sys- 
tems in how they deal with the wide variety of possible 
failure modes? 

To address this question, we present two advance- 
ments in the current state-of-the-art of testing. First, we 
introduce FATE (Failure Testing Service). Unlike exist- 
ing frameworks where multiple failures are only exer- 
cised randomly [6, 35, 38], FATE is designed to systemat- 
ically push cloud systems into many possible failure sce- 
narios. FATE achieves this by employing failure IDs as a 
new abstraction for exploring failures. Using failure IDs, 
FATE has exercised over 40,000 unique failure scenarios, 
and uncovers a new challenge: the exponential explosion 
of multiple failures. To the best of our knowledge, we 
are the first to address this in a more systematic way than 
random approaches. We do so by introducing novel pri- 
oritization strategies that explore non-similar failure sce- 
narios first. This approach allows developers to explore 
distinct recovery behaviors an order of magnitude faster 
compared to a brute-force approach. 

Second, we introduce DESTINI (Declarative Testing 
Specifications), which addresses the second half of the 
challenge in recovery testing: specification of expected 
behavior, to support proper testing of the recovery code 
that is exercised by FATE. With existing approaches, 
specifications are cumbersome and difficult to write, and 
thus present a barrier to usage in practice [15, 24, 25, 32, 
39]. To address this, DESTINI employs a relational logic 
language that enables developers to write clear, concise, 
and precise recovery specifications; we have written 74 
checks, each of which is typically about 5 lines of code. 
In addition, we present several design patterns to help de- 
velopers specify recovery. For example, developers can 
easily capture facts and build expectations, write spec- 
ifications from different views (e.g., global, client, data 
servers) and thus catch bugs closer to the source, express 
different types of violations (e.g., data-loss, availability), 
and incorporate different types of failures (e.g., crashes, 
network partitions). 

The rest of the paper is organized as follows. First, 
we dissect recovery problems in more detail (82). Next, 
we define our concrete goals (§3), and present the design 
and implementation of FATE (84) and DESTINI (85). We 
then close with evaluations (86) and conclusion (87). 
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2 Extended Motivation: 
Recovery Problems 


This section presents a study of recovery problems 
through three different lenses. First, we recap accounts 
of issues that cloud practitioners have shared in the lit- 
erature (82.1). Since these stories do not reflect details, 
we study bug/issue reports of modern open-source cloud 
systems (§2.2). Finally, to get more insights, we dissect 
a failure recovery protocol (82.3). We close this section 
by reviewing the state-of-the-art of testing ($2.4). 


2.1 Lens #1: Practitioners’ Experiences 


As well-known practitioners and academics have stated: 
“the future is a world of failures everywhere” [11]; “re- 
liability has to come from the software” [9]; “recovery 
must be a first-class operation” [8]. These are but a 
glimpse of the urgent need for failure recovery as we en- 
ter the cloud era. Yet, practitioners still observe recovery 
problems in the field. The engineers of Google’s Chubby 
system, for example, reported data loss on four occasions 
due to database recovery errors [5]. In another paper, 
they reported another imperfect recovery that brought 
down the whole system [6]. After they tested Chubby 
with random multiple failures, they found more prob- 
lems. BigTable engineers also stated that cloud sys- 
tems see all kinds of failures (e.g., crashes, bad disks, 
network partitions, corruptions, etc.) [7]; other practi- 
tioners agree [6, 9]. They also emphasized that, as 
cloud services often depend on each other, a recovery 
problem in one service could permeate others, affect- 
ing overall availability and reliability [7]. To conclude, 
cloud systems face frequent, multiple and diverse fail- 
ures [4, 6, 7, 9, 17]. Yet, recovery implementations are 
rarely tested with complex failures and are not rigorously 
specified [4, 6]. 


2.2 Lens #2: Study of Bug/Issue Reports 


These anecdotes hint at the importance and complex- 
ity of failure handling, but offer few specifics on how 
to address the problem. Fortunately, many open-source 
cloud projects (e.g., ZooKeeper [19], Cassandra [23], 
HDES [33]) publicly share in great detail real issues en- 
countered in the field. Therefore, we performed an in- 
depth study of HDFS bug/issue reports [2]. There are 
more than 1300 issues spanning 4 years of operation 
(April 2006 to July 2010). We scan all issues and study 
the ones that pertain to recovery problems due to hard- 
ware failures. In total, there are 91 recovery issues with 
severe implications such as data loss, unavailability, cor- 
ruption, and reduced performance (a more detailed de- 
scription can be found in our technical report [13]). 
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Based on this study, we made several observations. 
First, most of the internal protocols already anticipate 
failures. However, they do not cover all possible fail- 
ures, and thus exhibit problems in practice. Second, 
the number of reported issues due to multiple failures is 
still small. In this regard, excluding our 5 submissions, 
the developers only had reported 3 issues, which mostly 
arose in live deployments rather than systematic testing. 
Finally, recovery issues appeared not only in the early 
years of the development but also recently, suggesting 
the lack of adoptable tools that can exercise failures au- 
tomatically. Reports from other cloud systems such as 
Cassandra and ZooKeeper also raise similar issues. 


2.3 Lens #3: Write Recovery Protocol 


Given so many recovery issues, one might wonder what 
the inherent complexities are. To answer this, we dis- 
sect the anatomy of HDFS write recovery. As a back- 
ground, HDFS provides two write interfaces: write and 
append. There is no overwrite. The write protocol essen- 
tially looks simple, but when different failures come into 
the picture, recovery complexity becomes evident. Fig- 
ure | shows the write recovery protocol with three differ- 
ent failure scenarios. Throughout the paper, we will use 
HDEFS terminology (blocks, datanodes/nodes, and na- 
menode) [33] instead of GoogleFS terminology (chunks, 
chunk servers, and master) [10]. 

e Data-Transfer Recovery: Figure la shows a client 
contacting the namenode to get a list of datanodes to 
store three replicas of a block (sO). The client then initi- 
ates the setup stage by creating a pipeline (s1) and con- 
tinues with the data transfer stage (s2). However, during 
the transfer stage, the third node crashes (s2a). What 
Figure la shows is the correct behavior of data-transfer 
recovery. That is, the client recreates the pipeline by 
excluding the dead node and continues transferring the 
bytes from the last good offset (s2b); a background repli- 
cation monitor will regenerate the third replica. 

e Data-Transfer Recovery Bug: Figure 1b shows a 
bug in the data-transfer recovery protocol; there is one 
specific code segment that performs a bad error han- 
dling of failed data transfer (s2a). This bug makes the 
client wrongly exclude the good node (Node2) and in- 
clude the dead node (Node3) in the next pipeline cre- 
ation (s2b). Since Node3 is dead, the client recreates 
the pipeline only with the first node (s2c). If the first 
node also crashes at this point (a multiple-failure sce- 
nario), no valid blocks are stored. This implementation 
bug reduces availability (i.e., due to unmasked failures). 
We also found data-loss bugs in the append protocol due 
to multiple failures (86.2.1). 

e Setup-Stage Recovery: Finally, Figure lc shows 
how the setup-stage recovery is different than the data- 
transfer recovery. Here, the client first creates a pipeline 
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R1/2, and numeric letters represent the namenode, client, rack 
number, and datanodes respectively. The client always starts 
the activity to the namenode first before to the datanodes. 


from two nodes in Rack! and one in Rack2 (s0a). How- 
ever, due to the rack partitioning (s1), the client asks 
the namenode again for a new fresh pipeline (s0b); the 
client has not transferred any bytes, and thus could start 
streaming from the beginning. After asking the namen- 
ode in several retries (not shown), the pipeline contains 
only nodes in Rack! (s0b). At the end, all replicas only 
reside in one rack, which 1s correct because only one rack 
is reachable during write [33]. 


e Replication Monitor Bug: Although the previous case 
is correct, it reveals a crucial design bug in the back- 
ground replication monitor. This monitor unfortunately 
only checks the number of replicas but not the locations. 
Thus, even after the partitioning is lifted, the replicas are 
not migrated to multiple racks. This design bug greatly 
reduces the block availability if Rack1 is completely un- 
reachable (more in 85.2.3). 


To sum up, we have illustrated the complexity of re- 
covery by showing how different failure scenarios lead 
to different recovery behaviors. There are more problems 
within this protocol and other protocols. Without an ap- 
propriate testing framework, it is hard to ensure recovery 
correctness; in one discussion of a newly proposed re- 
covery design, a developer raised a comment: “I don’t 
see any proof of correctness. How do we know this will 
not lead to the same or other problems? [2]” 


2.4 Current State of the Art: Does It Help? 


In the last three sections, we presented our motivation 
for powerful testing frameworks for cloud systems. A 
natural question to ask is whether existing frameworks 
can help. We answer this question in two parts: failure 
exploration and system specifications. 


2.4.1 Failure Exploration 


Developers are accustomed to easy-to-use unit-testing 
frameworks. For fault-injection purposes, unit tests are 
severely limited; a unit test often simulates a limited 
number of failure scenarios, and when it comes to in- 
jecting multiple variety of failures, one common practice 
is to inject a sequence of random failures as part of the 
unit test [6, 35]. 

To improve common practices, recent work has pro- 
posed more exhaustive fault-injection frameworks. For 
example, the authors of AFEX and LFI observe that the 
number of possible failure scenarios is “infinite” [20, 28]. 
Thus, AFEX and LFI automatically prioritize “high- 
impact targets” (e.g., unchecked system calls, tests likely 
to fail). So far, they target non-distributed systems and 
do not address multiple failures in detail. 

Recent system model-checkers have also proposed the 
addition of failures as part of the state exploration strate- 
gies [21, 37, 38, 39]. MODIST, for example, is capa- 
ble of exercising different combinations of failures (e.g., 
crashes, network failures) [38]. As we discuss later, 
exploring multiple failures creates a combinatorial ex- 
plosion problem. This problem has not been addressed 
by the MODIST authors, and thus they provide a ran- 
dom mode for exploring multiple failures. Overall, we 
found no work that attempts to systematically explore 
multiple-failure scenarios, something that cloud systems 
face more often than other distributed systems in the 
past [4, 9, 17, 18]. 


2.4.2 System Specifications 


Failure injection addresses only half of the challenge in 
recovery testing: exercising recovery code. In addition, 
proper tests require specifications of expected behavior 
from those code paths. In the absence of such speci- 
fications, the only behaviors that can be automatically 
detected are those that interrupt testing (e.g. system fail- 
ures). One easy way is to write extra checks as part of 
a unit test. Developers often take this approach, but the 
problem is there are many specifications to write, and if 
they are written in imperative languages (e.g., Java) the 
code is bloated. 

Some model checkers use existing consistency checks 
such as fsck [39], a powerful tool that contains hun- 
dreds of consistency checks. However, it has some draw- 
backs. First, fsck is only powerful if the system is mature 
enough; developers add more checks across years of de- 
velopment. Second, fsck is also often written in impera- 
tive languages, and thus its implementations are complex 
and unsurprisingly buggy [15]. Finally, fsck can express 
only “invariant-like” specifications (i.e., it only checks 
the state of the file system, but not the events that lead 
to the state). As we will see later, specifying recovery 
requires “behavioral” specifications. 
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Another advanced checking approach is WiDS [24, 
25, 38]. As the target system runs, WiDS interposes and 
checks the system’s internal states. However, it employs 
a scripting language that still requires a check to be writ- 
ten in tens of lines of code [24, 25]. Furthermore, its 
interposition mechanism might introduce another issue: 
the checks are built by interposing specific implementa- 
tion functions, and if these functions evolve, the checks 
must be modified. The authors have acknowledged but 
not addressed this issue [24]. 

Frameworks for declarative specifications exist (e.g., 
Pip [32], P2 Monitor [34]). P2 Monitor only works if the 
target system is written in the same language [34]. Pip 
facilitates declarative checks, but a check 1s still written 
in over 40 lines on average [32]. Also, these systems 
are not integrated with a failure service, and thus cannot 
thoroughly test recovery. 

Overall, most existing work use approaches that could 
result in big implementations of the specifications. Man- 
aging hundreds of them becomes complicated, and they 
must also evolve as the system evolves. In practice, de- 
velopers are reluctant to invest in writing detailed speci- 
fications [2], and hence the number of written specifica- 
tions is typically small. 


3 Goals 


To address the aforementioned challenges, we present 
a new testing framework for cloud systems: FATE and 
DESTINI. We first present our concrete goals here. 

e Target systems and users: We primarily target cloud 
systems as they experience a wide variety of failures at 
a higher rate than any other systems in the past [14]. 
However, our framework is generic and applies to other 
distributed systems. Our targets so far are HDFS [33], 
ZooKeeper [19] and Cassandra [23]. We mainly use 
HDFS as our example in the paper. In terms of users, 
we target experienced system developers, with the goal 
of improving their ability to efficiently generate tests and 
specifications. 

e Seamless integration: Our approach requires source 
code availability. However, for adoptability, our frame- 
work should not modify the code base significantly. This 
is accomplished by leveraging mature interposition tech- 
nology (e.g., AspectJ). Currently our framework can be 
integrated to any distributed systems written in Java. 

e Rapid and systematic exploration of failures: Our 
framework should help cloud system developers explore 
multiple-failure scenarios automatically and more sys- 
tematically than random approaches. However, a com- 
plete systematic exploration brings a new challenge: a 
massive combinatorial explosion of failures, which takes 
tens of hours to explore. Thus, our testing framework 
must also be equipped with smart exploration strategies 
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(e.g., prioritizing non-similar failure scenarios first). 

e Numerous detailed recovery specifications: Ideally, 
developers should be able to write as many detailed spec- 
ifications as possible. The more specifications written, 
the finer bug reports produced, the less time needed for 
debugging. To realize this, our framework must meet two 
requirements. First, the specifications must be developer- 
friendly (i.e., concise, fast to write, yet easy to under- 
stand). Otherwise, developers will be reluctant to invest 
in writing specifications. Second, our framework must 
facilitate “behavioral” specifications. We note that ex- 
isting work often focuses on “invariant-like” specifica- 
tions. This is not adequate because recovery behaves dif- 
ferently under different failure scenarios, and while re- 
covery is still ongoing, the system is likely to go through 
transient states where some invariants are not satisfied. 


4 FATE: Failure Testing Service 


Within a distributed execution, there are many points 
in place and time where system components could fail. 
Thus, our goal is to exercise failures more methodically 
than random approaches. To achieve this, we present 
three contributions: a failure abstraction for express- 
ing failure scenarios (84.1), a ready-to-use failure ser- 
vice which can be integrated seamlessly to cloud sys- 
tems (84.2), and novel failure prioritization strategies that 
speed up testing time by an order of magnitude (84.3). 


4.1 Failure IDs: Abstraction For Failures 


FATE’s ultimate goal is to exercise as many combinations 
of failures as possible. In a sense, this is similar to model 
checking which explores different sequences of states. 
One key technique employed in system model checkers 
is to record the hashes of the explored states. Similarly 
in our case, we introduce the concept of failure IDs, an 
abstraction for failure scenarios which can be hashed and 
recorded in history. A failure ID is composed of an I/O 
ID and the injected failure (Table 1). Below we describe 
these subcomponents in more detail. 

e I/O points: To construct a failure ID, we choose I/O 
points (i.e., system/library calls that perform disk or net- 
work I/Os) as failure points, mainly for three reasons. 
First, hardware failures manifest into failed I/Os. Sec- 
ond, from the perspective of a node in distributed sys- 
tems, I/O points are critical points that either change its 
internal states or make a change to its outside world (e.g., 
disks, other nodes). Finally, I/O points are basic oper- 
ations in distributed systems, and hence an abstraction 
built on these points can be used for broader purposes. 

e Static and dynamic information: For each I/O point, 
an I/O ID is generated from the static (e.g., system call, 
source file) and dynamic information (e.g., stack trace, 
node ID) available at the point. Dynamic information 
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1/O ID Fields Values 
Static Func. call OutputStream. flushQ) 
Source File BlockRecv.java (line 45) 
Dynamic Stack trace (the stack trace) 
Node Id Node2 
Domain Source Node2 
specific Dest. Nodel 
Net. Mesg. Setup Ack 


Failure ID = hash ( I/O ID + Crash ) = 2849067135 


Table 1: A Failure ID. A failure ID comprises an I/O ID 
plus the injected failure (e.g., crash). Hash is used to record a 
failure ID. For space, some fields are not shown. 


is useful to increase failure coverage. For example, re- 
covery might behave differently if a failure happens in 
different nodes (e.g., first vs. last node in the pipeline). 

e Domain-specific information: To increase failure 
coverage further, an I/O ID carries domain-specific in- 
formation; a common I/O point could write to different 
file types or send messages to different nodes. FATE’s 
interposition mechanism provides runtime information 
available at an I/O point such as the target I/O (e.g., file 
names, IP addresses) and the I/O buffer (e.g., network 
packet, file buffer). To convert these raw information 
into a more meaningful context (e.g., “Setup Ack” in Ta- 
ble 1), FATE provides an interface that developers can 
implement. For example, given an I/O buffer of a net- 
work message, a developer can implement the code that 
reverse-engineers the byte content of the message into a 
more meaningful message type (e.g., “Setup Ack”). If 
the interface is empty, FATE can still run (the interface 
returns an empty domain-specific string), but failure cov- 
erage could be sacrificed. 

e Possible failure modes: Given an I/O ID, FATE gen- 
erates a list of possible failures that could happen on the 
I/O. For example, FATE could inject a disk failure on a 
disk write, or a network failure before a node sends a 
message. Currently, we support six failure types: crash, 
permanent disk failure, disk corruption, node-level and 
rack-level network partitioning, and transient failure. To 
create a failure ID, one failure type appropriate to the 
I/O is selected one at a time (and hence, given an I/O ID, 
FATE could produce multiple failure IDs). 


4.2 Architecture 


We built FATE with a goal of quick and seamless inte- 
gration into our target systems. Figure 2 depicts the four 
components of FATE: workload driver, failure surface, 
failure server, and filters. 


4.2.1 Workload Driver, Failure Surface, and Server 


We first instrument the target system (e.g., HDFS) by in- 
serting a “failure surface”. There are many possible lay- 


instrumented HDFS 


Failure 
Server 


Workload Driver 
while (server injects _} 
new failure!Ds) { 

runWorkload(); 

// ex: hdfs.write 


AspectJ — 
Failure : Fail / 
No Fail? 


} 





Figure 2: FATE Architecture. 


ers to insert a failure surface (e.g., inside a system library 
or at the VMM layer). We do this between the target sys- 
tem and the OS library (e.g., Java SDK), for two reasons. 
First, at this layer, rich domain-specific information is 
available. Second, by leveraging mature instrumentation 
technology (e.g., AspectJ), adding the surface requires 
no modification to the code base. 

The failure surface has two important jobs. First, at 
each I/O point, it builds the I/O ID. Second, it needs to 
check if a persistent failure injected in the past affects this 
I/O point (e.g., network partitioning). If so, the surface 
returns an error to emulate the failure without the need 
to talk to the server. Otherwise, it sends the I/O ID to the 
server and receives a failure decision. 

The workload driver is where the developer attaches 
the workload to be tested (e.g., write, append, or some se- 
quence of operations, including the pre- and post-setups) 
and specifies the maximum number of failures injected 
per run. As the workload runs, the failure server receives 
I/O IDs from the failure surface, combines the I/O IDs 
with possible failures into failure [Ds, and makes fail- 
ure decisions based on the failure history. The workload 
driver terminates when the server does not inject a new 
failure scenario. The failure server, workload driver, and 
target system are run as separate processes, and they can 
be run on single or multiple machines. 


4.2.2 Brute-Force Failure Exploration 


By default, FATE runs in brute-force mode. That is, FATE 
systematically explores all possible combinations of ob- 
served failure IDs. (The algorithm can be found in our 
technical report [13]). With this brute-force mode, FATE 
has exercised over 40,000 unique combinations of one, 
two and three failure IDs. We address this combinatorial 
explosion challenge in the next section (84.3). 


4.2.3 Filters 


FATE uses information carried in I/O and failure IDs to 
implement filters at the server side. A filter can be used to 
regenerate a particular failure scenario. For example, to 
regenerate the failure described in Table 1, a developer 
could specify a filter that will only exercise the corre- 
sponding failure ID. A filter could also be used to reduce 
the failure space. For example, a developer could insert 
a filter that allows crash-only failures, failures only on 
some specific I/Os, or any failures only at datanodes. 
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Figure 3: Prioritization of Pairwise Dependent and Independent Failures. 


4.3. Failure Exploration Strategy 


Running FATE in brute-force mode is impractical and 
time consuming. As an example, we have run the append 
protocol with a filter that allows crash-only failures on 
disk I/Os in datanodes. With this filter, injecting two fail- 
ures per run gives 45 failure [Ds to exercise, which leads 
us to 1199 combinations that take more than 2 hours to 
run. Without the filter (7.e., including network I/Os and 
other types of failures) the number will further increase. 
This introduces the problem of exponential explosion of 
multiple failures, which has to be addressed given the 
fact that we are dealing with large code base where an 
experiment could take more than 5 seconds per run (e.g., 
due to pre- and post-setup overheads). 

Among the 1199 experiments, 116 failed; if recovery 
is perfect, all experiments should be successful. Debug- 
ging all of them led us to 3 bugs as the root causes. Now, 
we can concretely define the challenge: Can FATE ex- 
ercise a much smaller number of combinations and find 
distinct bugs faster? This section provides some solu- 
tions to this challenge. To the best of our knowledge, we 
are the first to address this issue in the context of dis- 
tributed systems. Thus, we also hope that this challenge 
attracts system researches to present other alternatives. 

To address this challenge, we have studied the prop- 
erties of multiple failures (for simplicity, we begin with 
two-failure scenarios). A pair of two failures can be cate- 
gorized into two types: pairwise dependent and pairwise 
independent failures. Below, we describe each category 
along with the prioritization strategies. Due to space con- 
straints, we could not show the detailed pseudo-code, and 
thus we only present the algorithms at a high-level. We 
will evaluate the algorithms in Section 6.3. We also em- 
phasize that our proposed strategies are built on top of 
the information carried in failure IDs, and hence display 
the power of failure IDs abstraction. 


4.3.1 Pairwise Dependent Failures 


A pair of failure IDs is dependent if the second ID is 
observed only if the failure on the first ID is injected; 
observing the occurrence of a failure ID does not neces- 
sarily mean that the failure must be injected. The key 
here is to use observed I/Os to capture path coverage 
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information (this is an acceptable assumption since we 
are dealing with distributed systems where recovery es- 
sentially manifests into I/Os). Figure 3a illustrates some 
combinations of dependent failure IDs. For example, F 
is dependent on C or D (i.e., F will never be observed un- 
less C or D is injected). The brute-force algorithm will 
inefficiently exercise all six possible combinations: AE, 
BE, CE, DE, CF, and DF. 

To prioritize dependent failure [Ds, we introduce a 
strategy that we call recovery-behavior clustering. The 
goal is to prioritize “non-similar” failure scenarios first. 
The intuition is that non-similar failure scenarios typi- 
cally lead to different recovery behaviors, and recovery 
behaviors can be represented as a sequence of failure 
IDs. Thus, to perform the clustering, we first run a com- 
plete set of experiments with only one failure per run, 
and in each run we record the subsequent failure IDs. 

We formally define subsequent failure IDs as all ob- 
served IDs after the injected failure up to the point where 
the system enters the stable state. That is, recording re- 
covery only up to the end of the protocol (e.g., write) 
is not enough. This is because a failed I/O could leave 
some “garbage” that is only cleaned up by some back- 
ground protocols. For example, a failed I/O could leave 
a block with an old generation timestamp that should be 
cleaned up by the background replication monitor (out- 
side the scope of the write protocol). Moreover, different 
failures could leave different types of garbage, and thus 
lead to different recovery behaviors of the background 
protocols. By capturing subsequent failure IDs until the 
stable state, we ensure more fine-grained clustering. 

The exact definition of stable state might be different 
across different systems. For HDFS, our definition of 
stable state is: FATE reboots dead nodes if any, removes 
transient failures (e.g., network partitioning), sends com- 
mands to the datanodes to report their blocks to the na- 
menode, and waits until all datanodes receive a null com- 
mand (i.e., no background jobs to run). 

Going back to Figure 3a, the created mappings be- 
tween the first failures and their subsequent failure IDs 
are: {A> E}, {B— E}, {C+ E, F}, and {D— E, F}. The 
recovery behaviors then are clustered into two: {E}, and 
{E, F}. Finally, for each recovery cluster, we pick only 
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one failure ID on which the cluster is dependent. The fi- 
nal prioritized combinations are marked with bold edges 
in Figure 3a. That is, FATE only exercises: AE, CE, and 
CF. Note that E is exercised as a second failure twice be- 
cause it appears in different recovery clusters. 


4.3.2 Pairwise Independent Failures 


A pair of failure [Ds is independent if the second ID is 
observed even if the first ID is not injected. This case 
is often observed when the same piece of code runs in 
parallel, which is a common characteristic found in dis- 
tributed systems (e.g., two phase commit, leader election, 
HDEFS write and append). Figure 3b illustrates a scenario 
where the same I/O points A and B are executed concur- 
rently in three nodes (i.e., A1, A2, A3, B1, B2, B3). Let’s 
name these two I/O points A and B as static failure points, 
or SF'P in short (as they exclude node ID). With brute- 
force exploration, FATE produces 24 combinations (the 
12 bi-directional edges in Figure 3b). In more general, 
there are SF' P? x N(N —1) combinations, where N and 
SFP are the number of nodes and static failure points re- 
spectively. To reduce this quadratic growth, we introduce 
two levels of prioritization: one for reducing N(N — 1) 
and the other for SF P?. 

To reduce N(N — 1), we leverage the property of sym- 
metric code (i.e., the same code that runs concurrently 
in different nodes). Because of this property, if a pair 
of failures has been exercised at two static failure points 
of two specific nodes, it is not necessary to exercise the 
same pair for other pairs of nodes. For example, if A1B2 
has been exercised, it is not necessary to run A1B3, A2B1, 
A2B3, and so on. As a result, we have reduced N(N — 1) 
(i.e., any combinations of two nodes) to just one (i.e., a 
pair of two nodes); the NV does not matter anymore. 

Although the first level of reduction is significant, 
FATE still hits the SFP? bottleneck as illustrated in Fig- 
ure 3c. Here, instead of having two static failure points, 
there are four, which leads to 16 combinations. To re- 
duce SF'P?, we utilize the behavior clustering algorithm 
used in the dependent case. That is, if injecting failure 
ID A1 results in the same recovery behavior as in inject- 
ing Bi, then we cluster them together (i.e., only one of 
them needs to be exercised). Put simply, the goal is to 
reduce SF'P to SF Polustered, Which will reduce the in- 
put to the quadratic explosion (e.g., from 4 to 2 resulting 
in 4 uni-directional edges as depicted in Figure 3d). In 
practice, we have seen a reduction from fifteen SF'P to 
eight DL Pciustened: 


4.4 Summary 


We have introduced FATE, a failure testing service capa- 
ble of exploring multiple, diverse failures in systematic 
fashion. FATE employs failure [Ds as a new abstraction 


for exploring failures. FATE is also equipped with pri- 
oritization strategies that prioritize failure scenarios that 
result in distinct recovery actions. Our approaches are 
not sound; however by experience, all bugs found with 
brute-force are also found with prioritization (more in 
86.3). If developers have the time and resources, they 
could fall back to brute-force mode for more confidence. 
So far, we have only explained our algorithms for two- 
failure scenarios. We have generalized them to three- 
failure, but cannot present them due to space constraints. 
One fundamental limitation of FATE is the absence of 
I/O reordering [38], and thus it is possible that some or- 
derings of failures are not exercised. Adopting related 
techniques from existing work [38] will be be beneficial 
in our case. 


5 DESTINI: Declarative Testing 
Specifications 


After failures are injected, developers still need to check 
for system correctness. As described in the motivation 
(82.4), DESTINI attempts to improve the state-of-the- 
art of writing system specifications. In the following 
sections, we first describe the architecture (85.1), then 
present some examples (85.2), and finally summarize the 
advantages (85.3). Currently, we target recovery bugs 
that reduce availability (e.g., unmasked failures, fail- 
stop) and reliability (e.g., data-loss, inconsistency). We 
leave performance and scalability bugs for future work. 


5.1 Architecture 


At the heart of DESTINI is Datalog, a declarative rela- 
tional logic language. We chose the Datalog style as it 
has been successfully used for building distributed sys- 
tems [3, 26] and for verifying some aspects of system 
correctness (e.g., security [12, 31]). Unlike much of that 
work, we are not using Datalog to implement system in- 
ternals, but only to write correctness specifications that 
are checked relatively rarely. Hence we are less depen- 
dent on the efficiency of current Datalog engines, which 
are still evolving [3]. 

In terms of the architecture, DESTINI is designed such 
that developers can build specifications from minimal in- 
formation. To support this, DESTINI comprises three fea- 
tures as depicted in Figure 4. First, it interposes network 
and disk protocols and translates the available informa- 
tion into Datalog events (e.g., cnpEv ). Second, it records 
failure scenarios by having FATE inform DESTINI about 
failure events (e.g., fateEv). This highlights that FATE 
and DESTINI must work hand in hand, a valuable prop- 
erty that is apparent throughout our examples. Finally, 
based only on events, it records facts, deduces expecta- 
tions of how the system should behave in the future, and 
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DESTINI 
stateY(...) :- cnpEv(...), stateX(...); 





Figure 4: DESTINI Architecture. 


compares the two. 


5.1.1 Rule Syntax 


In DESTINI, specifications are formally written as Data- 
log rules. A rule is essentially a logical relation: 


errX(P1,P2,P3) :- cnpEv(P1), NOT-IN stateY(P1,P2,_), 


P2 == img, P3 := Util.strLib(P2); 


This Datalog rule consists of a head table (errX) 
and predicate tables in the body (cnpEv and stateY). 
The head is evaluated when the body is true. Tu- 
ple variables begin with an upper-case letter (P1). A 
don’t care variable is represented with an underscore 
(_). A comma between predicates represents conjunc- 
tion. “:=” 1s for assignments. We also provide some 
helper libraries (Util.strLib() to manipulate strings). 
Lower case variables (img) represent integer or string 
constants. All upper case letters (NOT-IN) are Datalog 
keywords. Events are in italic. To help readers track 
where events originate from, an event name begins with 
one of these labels: cnp, dnp, cdp, ddp, fs, which 
stand for client-namenode, datanode-namenode, client- 
datanode, datanode-datanode, and file system protocols 
respectively (Figure 4). Non-event (non-italic) heads and 
predicates are essentially database tables with primary 
keys defined in some schemas (not shown). A table that 
starts with err represents an error (i.e., if a specification 
is broken, the error table is non-empty, implying the ex- 
istence of one or more bugs). 


5.2 DESTINI Examples 


This section presents the powerful features of DESTINI 
via four examples of HDFS recovery specifications. In 
the first example, we present five important components 
of recovery specifications (85.2.1). To help simplify the 
complex debugging process, the second example shows 
how developers can incrementally add tighter specifica- 
tions (85.2.2). The third example presents specifications 
that incorporate a different type of failure than the first 
two examples (85.2.3). Finally, we illustrate how devel- 
opers can refine existing specifications (85.2.4). 
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5.2.1 Specifying Data-Transfer Recovery 


DESTINI facilitates five important elements of recovery 
specifications: checks, expectations, facts, precise fail- 
ure events, and check timings. Here, we present these 
elements by specifying the data-transfer recovery proto- 
col (Figure 1a); this recovery is correct if valid replicas 
are stored in the surviving nodes of the pipeline. 

e Checks: To catch violations of data-transfer recov- 
ery, we start with a simple high-level check (a1), which 
says “upon block completion, throw an error if there is 
a node that is expected to store a valid replica, but actu- 
ally does not.” This rule shows how a check is composed 
of three elements: the expectation (expectedNodes), fact 
(actualNodes), and check timing (cnpComplete ). 


e Expectations: The expectation (expectedNodes) 1s de- 
duced from protocol events (a2-a8). First, without any 
failure, the expectation is to have the replicas in all the 
nodes in the pipeline (a3); information about pipeline 
nodes are accessible from the setup reply from the na- 
menode to the client (a2). However, if there is a crash, 
the expectation changes: the crashed node should be re- 
moved from the expected nodes (a4). This implies that 
an expectation is also based on failure events. 


e Failure events: Failures in different stages result in 
different recovery behaviors. Thus, we must know pre- 
cisely when failures occur. For data-transfer recovery, 
we need to capture the current stage of the write pro- 
cess and only change the expectation if a crash occurs 
within the data-transfer stage (fateCrashNode happens 
at Stg==2 in rule a4). The data transfer stage is deduced 
in rules a5-a8: the second stage begins after all acks from 
the setup phase have been received. 


Before moving on, we emphasize two important ob- 
servations here. First, this example shows how FATE 
and DESTINI must work hand in hand. That is, recovery 
specifications require a failure service to exercise them, 
and a failure service requires specifications of expected 
failure handling. Second, with logic programming, de- 
velopers can easily build expectations only from events. 


e Facts: The fact (actualNodes) is also built from events 
(a9-a16); specifically, by tracking the locations of valid 
replicas. A valid replica can be tracked with two pieces 
of information: the block’s latest generation time stamp, 
which DESTINI tracks by interposing two interfaces (a9 
and a10), and meta/checksum files with the latest genera- 
tion timestamp, which are obtainable from file operations 
(all-a15). With this information, we can build the run- 
time fact: the nodes that store the valid replicas of the 
block (a16). 

e Check timings: The final step is to compare the ex- 
pectation and the fact. We underline that the timing of 
the check is important because we are specifying recov- 
ery behaviors, unlike invariants which must be true at 
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al 
a2 
a3 
a4 


ad 
a6 
a7 
a8 
aQ 
aid 
ail 
ai2 
ai3 
al4 
aid 


ai6 


bi 
b2 
b3 
b4 


cl 
c2 
c3 
c4 
c5 
c6 


di 
d2 


d3 
d4 
d5 


Table 2: Sample Specifications. 


Section 5.2.1 

errDataRec (B, N) 
pipeNodes (B, Pos, N) 
expectedNodes (B, N) 

DEL expectedNodes (B, N) 


setupAcks (B, Pos, Ack) 
goodAcksCnt (B, COUNT<Ack>) 
nodesCnt (B, COUNT<Node>) 
pipeStage (B, Stg) 
blkGenStamp (B, Gs) 
blkGenStamp (B, Gs) 
diskFiles (N, File) 
diskFiles (N, Dst) 

DEL diskFiles (N, Src) 
fileTypes (N, File, Type) 
blkMetas (N, B, Gs) 


actualNodes (B, N) 


Section 5.2.2 

errBadAck (Pos, N) 
liveNodes (N) 

DEL liveNodes (N) 
errBadConnect (N, TgtN) 


Section 5.2.3 
warnSingleRack (B) 
actualRacks (B, R) 

rackCnt (B, COUNT<R>) 

DEL connectedRacks (Ri, R2) 
erriRackOnCompletion (B) 
erriRackOnStableState (B) 


Section 5.2.4 
errLostUFile (UFile) 
ufileInNameNode (UFile) ** 


ufileInNameNode (UFile) 
ufileInNameNode (UFile) 
ufileInNameNode (UFile) 


Data-Transfer Recovery Specifications 

cnpComplete (B), expectedNodes (B, N), NOT-IN actualNodes (B, N); 
cnpGetBlkPipe (UFile, B, Gs, Pos, N); 

pipeNodes (B, Pos, N); 

fateCrashNode (N), pipeStage (B, Stg), Stg == 2, 
expectedNodes (B, N); 

capSetupAck (B, Pos, Ack); 

setupAcks (B, Pos, Ack), Ack == ’0K’; 

pipeNodes (B, _, N, -_); 

nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := 2; 
dnpNextGenStamp (B, Gs); 

cnpGetBlkPipe (UFile, B, Gs, _, _); 

fsCreate (N, File); 

fsRename (N, Src, Dst), diskFiles (N, Src); 

fsRename (N, Src, Dst), diskFiles (N, Src); 

diskFiles(N, File), Type := Util.getType(File) ; 

fileTypes (N, File, Type), Type == metafile, 

B := Util.getBlk(File), Gs := Util.getGs(File) ; 

blkMetas (N, B, Gs), blkGenStamp (B, Gs); 


Tighter Specifications for Data-Transfer Recovery 

cadpDataAck (Pos, ’Error’), pipeNodes (B, Pos, N), liveNodes (N); 
dnpRegistration (N); 

fateCrashNode (N); 

ddpDataTransfer (N, TgtN, Status), liveNodes (TgtN), 


Status == terminated; 


Rack-Aware Policy Specifications 

rackCnt (B, 1), actualRacks (B, R), connectedRacks (R, OtherR); 
actualNodes (B, N), nodeRackMap (N, R); 

actualRacks (B, R); 

fatePartitionRacks (Ri, R2); 

cnpComplete (B), warnSingleRack (B); 

fateStableState (_), warnSingleRack (B); 


Refining Log-Recovery Specifications 

expectedUFile (UFile), NOT-IN ufileInNameNode (UFile); 
ufileInNnFile(F, NnFile), (NnFile == img || NnFile == log || 
NnFile == img2) ; 


ufileInNnFile (F, img2), logRecStage (Stg), Stg == 4; 
ufileInNnFile (F, img) , logRecStage (Stg), Stg != 4; 
ufileInNnFile (F, log) , logRecStage (Stg), Stg != 4; 


The table lists all the rules we wrote to specify the problems in Section 5.2; Rules aX, bX, 


cX, and dX are for Sections 5.2.1, 5.2.2, 5.2.3, and 5.2.4 respectively. All logical relations are built only from events (in italic). The 


shaded rows indicate checks that catch violations. A check always starts with err. Tuple variables B, Gs, N, Pos, R, Stg, NnFile, 


and UFile are abbreviations for block, generation timestamp, node, position, rack, stage, namenode file, and user file respectively; 


others should be self-explanatory. Each table has primary keys defined in a schema (not shown). (**) Rule a2 is refined in d3 to 


d5; these rules are described more in our short paper [14]. 
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all time. Not paying attention to this will result in false 
warnings (i.e., there is a period of time when recovery is 
ongoing and specifications are not met). Thus, we need 
precise events to signal check times. In this example, the 
check time is at block completion (cnpComplete inal). 


5.2.2 Debugging with Tighter Specifications 


The rules in the previous section capture the high-level 
objective of HDFS data-transfer recovery. After we ran 
FATE to cover the first crash scenario in Figure 1b (for 
simplicity of explanation, we exclude the second crash), 
rule al throws an error due to a bug that wrongly ex- 
cludes the good second node (Figure 1b in §2.3). Al- 
though the check unearths the bug, it does not pinpoint 
the bug (i.e., answer why the violation is thrown). 

To improve this debugging process, we added more 
detailed specifications. In particular, from the events that 
DESTINI logs, we observed that the client excludes the 
second node in the next pipeline, which is possible if the 
client receives a bad ack. Thus, we wrote another check 
(b1) which says “throw an error if the client receives a 
bad ack for a live node” (b1’s predicates are specified 
in b2 and b3). Note that this check is written from the 
client’s view, while rule al from the global view. 

The new check catches the bug closer to the source, 
but also raises a new question: Why does the client re- 
ceive a bad ack for the second node? One logical ex- 
planation is because the first node cannot communicate 
to the second node. Thus, we easily added many checks 
that catch unexpected bad connections such as b4, which 
finally pinpoints the bug: the second node, upon seeing 
a failed connection to the crashed third node, incorrectly 
closes the streams connected to the first node; note that 
this check is written from the datanode’s view. 

In summary, more detailed specifications prove to be 
valuable for assisting developers with the complex de- 
bugging process. This is unlikely to happen if a check 
implementation is long. But with DESTINI, a check can 
be expressed naturally in a small number of logical re- 
lations. Moreover, checks can be written from different 
views (e.g., global, client and datanode as shown in al, 
b1, b4 respectively). Table 3 shows a timeline of when 
these various checks are violated. As shown, tighter 
specifications essentially fill the “explanation gaps” be- 
tween the injected failure and the wrong final state of the 
system. 


5.2.3 Specifying Rack-Aware Replication Policy 


In this example, we write specifications for the HDFS 
rack-aware replication policy, an important policy for 
high availability [10, 33]. Unlike previous examples, this 
example incorporates network partitioning failure mode. 

According to the HDFS architects [33], the write pro- 
tocol should ensure that block replicas are spread across 
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Time, Events, and Errors 
t1: Client asks the namenode for a block ID and the nodes. 
cnpGetBlLkPipe (usrFile, blk_x, gsi, 1, N1); 
cnpGetBLkPipe (usrFile, blk_x, gsi, 2, N2); 
cnpGetBlLkPipe (usrFile, blk_x, gsi, 3, N3); 
t2: Setup stage begins (pipeline nodes setup the files). * 
fsCreate (N1, tmp/blk_x_gsi.meta) ; 
fsCreate (N2, tmp/blk_x_gsi.meta) ; 
fsCreate (N3, tmp/blk_x_gsi.meta) ; 
t3: Client receives setup acks. Data transfer begins. 
cdpSetupAck (blk_x, 1, OK); 
cdpSetupAck (blk_x, 2, OK); 
cdpSetupAck (blk_x, 3, OK); 
t4: FATE crashes N3. Got error (b4). 
fateCrashNode (N3); 
errBadConnect (N1, N2); // should be good 
t5: Client receives an errorneous ack. Got error (b1). 
cadpDataAck (2, Error); 
errBadAck (2, N2); // should be good 
t6: Recovery begins. Get new generation time stamp. 
dnpNextGenStamp (blk_x, gs2); 
t7: Only N1 continues and finalizes the files. 
fsCreate (N1, tmp/blk_x_gs2.meta) ; 
fsRename (N1, tmp/blk_x_gs2.meta, 
current/blk_x_gs2.meta) ; 
t8: Client marks completion. Got error (a1). 
cnpComplete (blk_x) ; 
errDataRec (blk_x, N2); // should exist 


Table 3: A Timeline of DESTINI Execution. The 
table shows the timeline of runtime events (italic) and errors 
(shaded). 
time. The tuples (strings/integers) are real entries (not variable 


Tighter specifications capture the bug earlier in 


names). For space, we do not show block-file creations (but 
only meta files* ) nor how the rules in Table 2 are populated. 


a minimum of two available racks. But, if only one rack 
is reachable, it is acceptable to use one rack temporar- 
ily. To express this, rule cl throws a warning if a block’s 
rack could reach another rack, but the block’s rack count 
is one (rules c2-c4 provide topology information, which 
is initialized when the cluster starts and updated when 
FATE creates a rack partition). This warning becomes a 
hard error only if it is true upon block completion (¢5) or 
stable state (c6). Note again how these timings are im- 
portant to prevent false errors; while recovery is ongoing, 
replicas are still being re-shuffled into multiple racks. 


With these checks, DESTINI found the bug in Fig- 
ure Ic (§2.3), a critical bug that could greatly reduce 
availability: all replicas of a block are stored in a sin- 
gle rack. Note that the bug does not violate the comple- 
tion rule (because the racks are still partitioned). But, it 
does violate the stable state rule because even after the 
network partitioning is removed, the replication monitor 
does not re-shuffle the replicas. 
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5.2.4 Refining Specifications 


In the second example (85.2.2), we demonstrated how 
developers can incrementally add detailed specifications. 
In this section, we briefly show how developers can refine 
existing specifications (an extensive description can be 
found in our short paper [14]). 

Here, we specify the HDFS log-recovery process in 
order to catch data-loss bugs in this protocol. The high- 
level check (d1) is fairly simple: “a user file is lost if it 
does not exist at the namenode.” To capture the facts, we 
wrote rule d2 which says “at any time, user files should 
exist in the union of all the three namenode files used in 
log recovery.” With these rules, we found a data-loss bug 
that accidentally deletes the metadata of user files. But, 
the error is only thrown at the end of the log recovery 
process (i.e., the rules are not detailed enough to pinpoint 
the bug). We then refined rule d2 to reflect in detail the 
four stages of the process (d3 to d5). That is, depending 
on the stage, user files are expected to be in a different 
subset of the three files. With these refined specifications, 
the data-loss bug was captured in between stage 3 and 4. 


5.3. Summary of Advantages 


Throughout the examples, we have shown the advantages 
of DESTINI: it facilitates checks, expectations, facts, 
failure events, and precise timings; specifications can be 
written from different views (e.g., global, client, datan- 
ode); different types of violations can be specified (e.g., 
availability, data-loss); different types of failures can be 
incorporated (e.g., crashes, partitioning); and specifica- 
tions can be incrementally added or refined. Overall, 
the resulting specifications are clear, concise, and pre- 
cise, which potentially attracts developers to write many 
specifications to ease complex debugging process. All 
of these are feasible due to three important properties 
of DESTINI: the interposition mechanism that translates 
disk and network events; the use of relational logic lan- 
guage which enables us to deduce complex states only 
from events; and the inclusion of failure events from the 
collaboration with FATE. — Besides these advantages, 
adopting DESTINI requires one major effort: develop- 
ers need to reverse-engineer raw I/O information (e.g., 
I/O buffer, stack trace) collected from the Java-based in- 
terposition mechanism into semantically-richer Datalog 
events (€.g., cnpComplete). However, we hope that this 
effort will also be useful for other debugging techniques 
that need detailed I/O information. 


6 Evaluation 


We evaluate FATE and DESTINI in several aspects: the 
general usability for cloud systems (86.1), the ability to 
catch multiple-failure bugs (86.2), the efficiency of our 


prioritization strategies (86.3), the number of specifica- 
tions we have written and their reusability (86.4), the 
number of new bugs we have found and old bugs repro- 
duced (86.5), and the implementation complexity (86.6). 

Since we currently only test reliability (but not per- 
formance), it is sufficient to run FATE, DESTINI, and the 
target systems as separate processes on a single machine; 
network and disk failures are emulated (manifested as 
Java I/O exceptions), and crashes are emulated with pro- 
cess crashes. Nevertheless, FATE and DESTINI can run 
on separate machines. 


6.1 Target Systems and Protocols 


We have integrated FATE and DESTINI to three cloud 
systems: HDFS [33] v0.20.0 and v0.20.2+320 (the latter 
is released in Feb. 2010 and used by Cloudera and Face- 
book), ZooKeeper [19] v3.2.2 (Dec. 2009), and Cassan- 
dra [23] v0.6.1 (Apr. 2010). We have run our frame- 
work on four HDFS workloads (log recovery, write, ap- 
pend, and replication monitor), one ZooKeeper work- 
load (leader election), and one Cassandra workload (key- 
value insert). In this paper, we only present exten- 
sive evaluation numbers for HDFS. For Cassandra and 
ZooKeeper, we only present partial results. 


6.2 Multiple-Failure Bugs 


The uniqueness of our framework 1s the ability to explore 
multiple failures systematically, and thus catch corner- 
case multiple-failure bugs. Here, we describe two out of 
five multiple-failure bugs that we found. 


6.2.1 Append Bugs 


We begin with a multiple-failure bug in the HDFS ap- 
pend protocol. Unlike write, append is more complex 
because it must atomically mutate block replicas [36]. 
HDES developers implement append with a custom pro- 
tocol; their latest append design was written in a 19-page 
document of prose specifications [22]. Append was fi- 
nally supported after being a top user demand for three 
years [36]. As a note, Google FS also supports append, 
but its authors did not share their internal design [10]. 

In the experiment setup, a block has three replicas in 
three nodes, and thus should survive two failures. On 
append, the three nodes form a pipeline. N1 starts a 
thread that streams the new bytes to N2 and then N1 ap- 
pends the bytes to its block. N2 crashes at this point, and 
N1 sends a bad ack to the client, but does not stop the 
thread. Before the client continues streaming via a new 
pipeline, all surviving nodes (N1 and N3) must agree on 
the same block offset (the syncOffset process). In this 
process, each node stops the writing thread, verifies that 
the block’s in-memory and on-disk lengths are the same, 
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broadcasts the offset, and picks the smallest offset. How- 
ever, N1 might have not updated the block’s in-memory 
length, and thus throws an exception resulting in the new 
pipeline containing only N3. Then, N3 crashes, and the 
pipeline is empty. The append fails, but worse, the block 
in N1 (still alive) becomes “trapped” (i.e., inaccessible). 
After FATE ran all the background protocols (e.g., lease 
recovery), the block is still trapped and permanently in- 
accessible. We have submitted a fix for this bug [2]. 


6.2.2 Combinations of Different Failures 


We have also found a new data-loss bug due to a se- 
quence of different failure modes, more specifically, tran- 
sient disk failure (#1), crash (#2), and disk corruption 
(#3) at the namenode. The experiment setup was that the 
namenode has three replicas of metadata files on three 
disks, and one disk is flaky (exhibits transient failures 
and corruptions). When users store new files, the na- 
menode logs them to all the disks. If a disk (e.g., Disk1) 
returns a transient write error (#1), the namenode will ex- 
clude this disk; future writes will be logged to the other 
two disks (7.e., Disk1 will contain stale data). Then, the 
namenode crashes after several updates (#2). When the 
namenode reboots, it will load metadata from the disk 
that has the latest update time. Unfortunately, the file that 
carries this information is not protected by a checksum. 
Thus, if this file is corrupted (#3) such that the update 
time of Disk1 becomes more recent than the other two, 
then the namenode will load stale data, and flush the stale 
data to the other two disks, wiping out all recent updates. 
One could argue that this case is rare, but cloud-scale de- 
ployments cause rare bugs to surface; a similar case of 
corruption did occur in practice [2]. Moreover, data-loss 
bugs are serious ones [27, 29, 30]. 


6.3 Prioritization Efficiency 


When FATE was first deployed without prioritization, we 
exercised over 40,000 unique combinations of failures, 
which combine into 80-hour of testing time. Thousands 
of experiments failed (probably only due to tens of bugs). 
Although 80 hours seems a reasonable testing time to un- 
earth crucial reliability bugs, this long testing time only 
covers several workloads; in reality, there are more work- 
loads to test. In addition, as developers modify their 
code, they likely to prefer faster turn-around time to find 
new bugs from their new changes. Overall, this long test- 
ing is an overwhelming situation, but which fortunately 
unfolds into a good outcome: new strategies for multiple- 
failure prioritization. 

To evaluate our strategies, we first focused only on two 
protocols (write and append) because we need to com- 
pare the brute-force with the prioritization results. More 
specifically, for each method, we count the number of 
combinations and the number of distinct bugs. Our hope 
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Workload #F STR #EXP FAIL BUGS 
Append 2 BF 1199 116 3 
PR 112 17 3 
Append 3 BF 7720 = **3693 *3 
PR 618 72 *3 
Write 2 BF 524 120 Z 
PR 49 27 Z 
Write 3 BF 3221 911 a2 
PR 333 82 2 


Table 4: Prioritization Efficiency. The columns from left 
to right are the number of injected failures per run (F), explo- 
ration strategy (STR), combinations/experiments (EXP), failed 
experiments (FAIL), and bugs found (BUGS). BF and PR stands 
for brute-force and prioritization respectively. Note that the 
bug counts are only due to two and three failures and depend 
on the filter (i.e., there are more bugs than shown). (*) Bugs in 
three-failure experiments are the same as in two-failure ones. 
(“*) This high number is due to a design bug; we used triaging 
to help us classify the bugs (not shown). 


is that the latter is the same for brute-force and prior- 
itization. Table 4 shows the result of running the two 
workloads with two and three failures per run, and with 
a lightweight filter (crash-only failures on disk I/Os in 
datanodes); without this filter, the number of brute-force 
experiments is too large to debug. In short, the table 
shows that our prioritization strategies reduce the total 
number of experiments by an order of magnitude (the 
testing time for the workloads in Table 4 is reduced from 
26 hours to 2.5 hours). In addition, from our experience 
no bugs are missing. Again, we cannot prove that our 
approach is sound; developers could fall back to brute- 
force for more confidence. ‘Table 4 also highlights the 
exponential explosion of combinations of multiple fail- 
ures; the numbers for three failures are much higher than 
those for two failures (e.g., 7720 vs. 1119). So far, we 
only cover up to 3 failures, and our techniques still scale 
reasonably well (i.e., they still give an order of magni- 
tude improvement). 


6.4 Specifications 


In the last six months, we have written 74 checks on top 
of 174 rules for a total of 351 lines (65 checks for HDFS, 
2 for ZooKeeper, and 7 for Cassandra). We want to em- 
phasize that rules ratio displays how DESTINI empow- 
ers specification reuse (i.e., building more checks on top 
of existing rules). As a comparison, the ratio for our first 
check (85.2.1 in Table 2) is 16:1, but the ratio now is 3:1. 

Table 5 compares DESTINI with other related work. 
The table highlights that DESTINI allows a large number 
of checks to be written in fewer lines of code. We want to 
note that the number of specifications we have written so 
far only represents six recovery protocols; there are more 
that can be specified. As time progresses, we believe the 
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Type Framework #Chks___ Lines/Chk 
S/I D3S [24] 10 53 
D/I Pip [32] 44 43 
S/T  WiDS [25] 15 22 
D/D — P2 Monitor [34] 11 12 
D/I —DESTINI 74 5 


Table 5: DESTINI vs. Related Work. The table com- 
pares DESTINI with related work. D, S, and I represent declar- 
ative, scripting, and imperative languages respectively. X/Y im- 
plies specifications in X language for systems in Y language. 
We divide existing work into three classes (S/I, D/D, D/I). 


simplicity offered by DESTINI will open the possibility 
of having hundreds of specifications along with more re- 
covery specification patterns. 

To show how our style of writing specifications is ap- 
plicable to other systems, we present in more detail some 
specifications we wrote for ZooKeeper and Cassandra. 


6.4.1 ZooKeeper 


We have integrated our framework to ZooKeeper [19]. 
We picked two reported bugs in the version we analyzed. 
Let’s say three nodes N1, N2, and N3, participate in a 
leader election, and id(N1) < id(N2) < id(N3). If N3 
crashes at any point in this process, the expected behavior 
is to have N1 and N2 form a 2-quorum. However, there is 
a bug that does not anticipate N3 crashing at a particular 
point, which causes N1I and N2 to continue nominating 
N3 in ever-increasing rounds. As a result, the election 
process never terminates and the cluster never becomes 
available. To catch this bug, we wrote an invariant vio- 
lation “a node chooses a winner of a round without en- 
suring that the chosen leader has in itself voted in the 
round.” The other bug involves multiple failures and can 
be caught with an addition of just one check; we reuse 
rules from the first bug. So far, we have written 12 rules 
for ZooKeeper. 


6.4.2 Cassandra 


We have also done the same for Cassandra [23], and 
picked three reported bugs in the version we analyzed. In 
Cassandra, the key-value insert protocol allows users to 
specify a consistency level such as one, quorum, or all, 
which ensures that the client waits until the key-value 
has been flushed on at least one, N/2 + 1, or all N nodes 
respectively. These are simple specifications, but again, 
due to complex implementation, bugs exist and break the 
rules. For example, at level al1, Cassandra could incor- 
rectly return a success even when only one replica has 
been completed. FATE is able to reproduce the failure 
scenarios and DESTINI is equipped with 7 checks (in 12 
rules) to catch consistency-level related bugs. 


6.5 New Bugs and Old Bugs Reproduced 


We have tested HDFS for over eight months and sub- 
mitted 16 new bugs, out of which 7 uncovered design 
bugs (i.e., require protocol modifications) and 9 uncov- 
ered implementation bugs. All have been confirmed by 
the developers. For Cassandra and ZooKeeper, we ob- 
served some failed experiments, but since we do not have 
the chance to debug all of them, we have no new bugs to 
report. 

To further show the power of our framework, we ad- 
dress two challenges: Can FATE reproduce all the failure 
scenarios of old bugs? Can DESTINI facilitate specifica- 
tions that catch the bugs? Before proposing our frame- 
work for catching unknown bugs, we wanted to feel con- 
fident that it is expressive enough to capture known bugs. 
We went through the 91 HDES recovery issues (82.2) 
and selected 74 that relate to our target workloads (86.1). 
FATE is able to reproduce all of them; as a proof, we 
have created 22 filters (155 lines in Java) to reproduce all 
the scenarios. Furthermore, we have written checks that 
could catch 46 old bugs; since some of the old bugs have 
been fixed in the version we analyzed, we introduced ar- 
tificial bugs to test our specifications. For ZooKeeper and 
Cassandra, we have reproduced a total of five bugs. 


6.6 FATE and DESTINI Complexity 


FATE comprises generic (workload driver, failure server, 
failure surface) and domain-specific parts (workload 
driver, I/O IDs). The generic part is written in 3166 lines 
in Java. The domain-specific parts are 422, 253, and 
357 lines for HDFS, ZooKeeper and Cassandra respec- 
tively; the part for HDFS is bigger because HDFS was 
our first target. DESTINI’s implementation cost comes 
from the translation mechanism (85.1). The generic part 
is 506 lines. The domain-specific parts are 732 (more 
complete), 23, and 35 lines for HDFS, ZooKeeper, and 
Cassandra respectively. FATE and DESTINI interpose the 
target systems with AspectJ (no modification to the code 
base). However, it was necessary to slightly modify the 
systems (less than 100 lines) for two purposes: defer- 
ring background tasks while the workload is running and 
sending stable-state commands. 


7 Conclusion and Future Work 


The scale of cloud systems — in terms of both infrastruc- 
ture and workload — makes failure handling an urgent 
challenge for system developers. To assist developers in 
addressing this challenge, we have presented FATE and 
DESTINI as a new framework for cloud recovery testing. 
We believe that developers need both FATE and DESTINI 
as a unified framework: recovery specifications require 
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a failure service to exercise them, and a failure service 
requires specifications of expected failure handling. 

Beyond finding problems in existing systems, we be- 
lieve such testing is also useful in helping to generate 
new ideas on how to build robust, recoverable systems. 
For example, one new approach we are currently inves- 
tigating is the increased use of pessimism to avoid prob- 
lems during recovery. For example, HDFS lease recov- 
ery would have been more robust had it not trusted as- 
pects of the append protocol to function correctly (86.2). 
Many other examples exist; only through further care- 
ful testing and analysis will the next generation of cloud 
systems meet their demands. 
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Abstract 


Network emulation brings together the strengths of net- 
work simulation (scalability, modeling flexibility) and 
real-world software prototypes (realistic analysis). Un- 
fortunately network emulation fails if the simulation is 
not real-time capable, e.g., due to large scenarios or com- 
plex models. So far, this problem has generally been ad- 
dressed by providing massive computing power to the 
simulation, which is often too costly or even infeasible. 
In this paper we present SliceTime, our platform for 
scalable and accurate network emulation. It enables the 
precise evaluation of arbitrary networking software with 
event-based simulations of any complexity by relieving 
the network simulation from its real-time constraint. We 
achieve this goal by transparently matching the execu- 
tion speed of virtual machines hosting the software pro- 
totypes with the network simulation. We demonstrate the 
applicability of SliceTime in a large-scale WAN scenario 
with 15000 simulated nodes and show how our frame- 
work eases the analysis of software for 802.11 networks. 


1 Introduction 


We are still in need of adequate tools for performance in- 
vestigations as well as for testing of real-world network 
protocol implementations and large-scale distributed sys- 
tems. In this regard, the first major requirement is scal- 
ability. For example, in order to facilitate the analysis 
of contemporary P2P applications, such a tool needs to 
scale up to potentially thousands of nodes. Second, we 
need experimentation platforms that isolate the protocol 
implementation and its communication from real-world 
communication networks. Such strong isolation 1s impor- 
tant for the investigation of malware to prevent a poten- 
tial outbreak. Isolated evaluation environments are also 
well suited for the analysis of software for wireless net- 
works as unwanted disturbances on the wireless channel 
can be avoided. 


Discrete event-based network simulation is a well- 
established methodology for the evaluation of network 
protocols. Network simulators, such as ns-3 [27] or OM- 
NeT++ [37], facilitate the flexible analysis of arbitrary 
network protocols. Due to their abstract modeling ap- 
proach, network simulations scale well to network sizes 
of up to many thousand nodes. 

However, abstract simulation models focus only on the 
most relevant aspects of the communicating nodes. They 
disregard the system context of a network protocol and 
its run-time environment, like the influence of an operat- 
ing system regarding timing, concurrent processes, and 
resource constraints. This fundamental concept of ab- 
straction limits the applicability of network simulations 
to network performance metrics. For instance, investiga- 
tions of run-time performance, resource usage, and the 
interoperability with other protocol implementations are 
difficult to obtain by solely using simulations. The strict 
event-based notion of network simulators also makes it 
generally impossible to execute arbitrary networking ap- 
plications inside the simulation environment. These 1s- 
sues complicate performance studies that are very im- 
portant for the applicability of communication systems. 

Performance evaluations under real conditions are 
mostly carried out within network testbeds of proto- 
type implementations. However, setting up large-scale 
testbeds is expensive and their maintenance is often 
cumbersome. Shared testbeds such PlanetLab [7], Emu- 
lab [42] and MoteLab [41] partially fill this gap. Yet their 
flexibility is limited due to a lack of topology controlla- 
bility, shared testbed usage or insufficient scalability. 

Network emulation as introduced by Fall [10] brings 
together the flexibility of discrete event-based of network 
simulation with the precision of evaluation using real- 
world testbeds. An event-based simulation modeling a 
computer network of choice is connected to real-world 
software prototype. Traffic from the prototype is fed to 
the simulation and vice versa. This way, the software pro- 
totype can be evaluated in any network that can be mod- 
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eled by the simulator. One fundamental issue of network 
emulation are the different time representations of event- 
based simulations and software prototypes. Event-based 
simulations consist of a series of discrete events with an 
associated event execution time. Once an event has been 
processed, the simulation time is advanced to the execu- 
tion time of the next event. By contrast, software pro- 
totypes observe a continuously progressing wall-clock 
time. 

Existing implementations of network emulation pin 
the execution of simulation events to the correspond- 
ing wall-clock time. Unfortunately, this approach is only 
useful if the simulation can be executed in real time. Oth- 
erwise, a simulation without sufficient computational re- 
sources will lag behind and thus be unable to deliver 
packets timely. Such simulator overload may result from 
complex network simulations, for example due to a high 
number of simulated nodes or models of high compu- 
tational complexity. Simulator overload has to be pre- 
vented because deficient protocol behavior such as con- 
nection time-outs, unwanted retransmissions, or the as- 
sumption of network congestion would be the direct con- 
sequence. Moreover, even slight simulator overload may 
invalidate performance evaluations because the network 
cannot be simulated within the required timing bounds. 

Speeding up the simulation to make it real-time ca- 
pable is the first obvious option to deal with simula- 
tion overload. This speed-up can be achieved by sup- 
plying the simulation machine with sufficient computa- 
tional resources in forms of hardware or by parallelizing 
the network simulation. However, we argue that this ap- 
proach lacks generality because parallel processing can 
only scale to the degree of possible parallelism within the 
simulation. In addition, the amount of hardware needed 
for real-time execution rapidly grows with the simulation 
complexity, making this option inaccessible for many re- 
search institutes and individuals. 

So far, network emulation has merely been an arms 
race between the complexity of the simulation model 
and the computational power of the simulation hardware. 
Hence, traditional approaches result in variable hard- 
ware requirements and fixed execution time (real time). 
By contrast, we aim at reducing the cost of precise net- 
work emulation by designing a system with fixed hard- 
ware demands but with variable execution time (real time 
or slower). More specifically, the main contributions of 
this paper are the following: 


1. We thoroughly elaborate the design of Slice- 
Time and its underlying concept of synchronized 
network emulation [39, 40] (Section 2). It elimi- 
nates the need of the network simulation to exe- 
cute in real-time. This enables network emulation 
scenarios using simulations of any complexity. We 
achieve this goal by synchronizing the software pro- 
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totypes with the network simulation. Using virtual- 
ization, we decouple the software prototypes’ per- 
ceived progression of time from wall-clock time. 


2. Our implementation of SliceTime (cf. Section 3) for 
X86 systems enables the synchronized execution of 
Xen-based [3] virtualized prototypes and ns-3 sim- 
ulations with an accuracy down to 0.01 ms. 


3. We show that SliceTime delivers a high degree of ac- 
curacy and transparency, both regarding timing and 
perceived network bandwidth (Section4). We fur- 
ther demonstrate in our evaluation of SliceTime that 
is run-time efficient and that the synchronization 
overhead stays below 10% at an accuracy of 0.5 ms. 


4. We illustrate how SliceTime simplifies testing and 
performance evaluations of WiFi software for Linux 
by remodeling a large-scale AODV field test en- 
tirely in software. We further demonstrate the scal- 
ability of SliceTime by applying it to a large-scale 
wide-area network (WAN) scenario with 15 000 
nodes (Section 5). 


In Section 6 we discuss the related work before conclud- 
ing this paper in Section 7. 


2 SliceTime 


We now present the design of SliceTime. A  Slice- 
Time setup incorporates three main components 
(cf. Figure 1): The central synchronization component 
(synchronizer), at least one virtual machine (VM) carry- 
ing a software prototype of choice, and an event-based 
network simulation. The synchronizer controls the 
execution of the network simulation and the software 
prototypes. In order to carry out such a synchronization, 
the synchronizer must interrupt the execution of the 
prototype or the simulation at times to achieve precise 
clock alignment. To enable this suspension, the software 
prototypes are hosted inside virtual machines for means 
of control. 


2.1 Synchronization Component 


The synchronization component centrally coordinates a 
SliceTime setup. Its task is to manage the synchronous 
execution of the network simulation and the attached 
virtual machines. It implements a synchronization algo- 
rithm to prevent potential time drifts and clock misalign- 
ments between the virtual machines and the network sim- 
ulation. As choice for the synchronization algorithm, we 
consider solutions known from the research domain of 
parallel discrete event-based simulation (PDES) [11]. In 
this regard, two classes of synchronization are distin- 
guished, optimistic synchronization schemes and conser- 
vative synchronization schemes. 
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Figure 1: Conceptual Overview of SliceTime: By relying 
on entirely virtualized prototypes, we are able to syn- 
chronize the execution speed of the simulation and the 
prototypes. The simulation is relieved from its real-time 
constraint, enabling large-scale network emulation sce- 
narios on off-the-shelf hardware. 


Optimistic schemes, most notably Time Warp [18], ex- 
ecute the parallel simulation in a speculative fashion. In 
case of synchronization errors, roll-backs are used to re- 
store a consistent and error-free global state. For the abil- 
ity to roll back to a consistent state, optimistic schemes 
often incorporate regular snapshots of the synchronized 
peers. As the state of a virtual machine includes the 
memory allocated for the running operating system in- 
stance, check-pointing is costly at the desired level of 
synchronization granularity. Conservative synchroniza- 
tion schemes, by contrast, guarantee a parallel execu- 
tion without synchronization errors, and hence, do not 
require a roll-back mechanism. However, most conser- 
vative schemes, such as the null-message algorithm by 
Chandy and Misra [6], require knowledge about the fu- 
ture behavior (look-ahead) of a system. While the look- 
ahead in event-based simulations can be determined by 
inspecting their event queue, predicting the future run- 
time behavior of a virtual machine is generally not possi- 
ble. In effect, this limits the choice of a synchronization 
algorithm for SliceTime to a scheme which neither makes 
assumptions about the future behavior nor requires regu- 
lar snapshots to be taken. 


SliceTime uses a scheme similar to conservative time 
windows (CTW) [23] for synchronizing network simu- 
lations and VMs. In the following, we refer to this al- 
gorithm as barrier synchronization. Figure 2 shows the 
synchronization of two components, one VM and one 
network simulation, via the barrier synchronization al- 
gorithm. It allows every synchronized peer to run for 
a certain amount of time, the so-called time slice, af- 
ter which it blocks until all other peers reach the bar- 
rier. At this point, the barrier is lifted, and a new future 
barrier is set up to which the execution of the synchro- 
nized components continues again. As the execution of 
both the network simulation and the virtual machine is 
always bounded by a barrier, the time drift between them 
is limited to the size of one time slice at all times. Con- 
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Figure 2: Different steps of the barrier algorithm used 
for the synchronization of one VM and one event-based 
simulation. The execution of the simulation and the VM 
is blocked until both have finished the time slice. 


sequently, the synchronization accuracy is directly given 
by the size of the time slice. 


2.2 Virtual Machines 


The virtual machines encapsulate the software prototype 
to be integrated with the network simulation. We con- 
sider a prototype to be an instance of any operating sys- 
tem (OS) that carries arbitrary network protocol imple- 
mentations or applications. The virtualization of OS in- 
stances hosting software prototypes disassociates their 
execution from the system hardware and hence allows 
for obtaining full control over their run-time behavior. 

Therefore, the execution of the prototype can be sus- 
pended until all synchronized components have reached 
the end of the time slice. This suspension avoids sim- 
ulator overload by allowing the network simulator to 
run while the virtual machines are waiting. However, 
this suspension is typically detectable by the VMs, be- 
cause they are relayed information from hardware time 
sources. Under normal circumstances, this behavior is 
desired to keep the clock synchronized to wall-clock time 
and to make sure that timers expire at the right point of 
time. However, since we suspend the VMs in order to 
synchronize their time against each other and the sim- 
ulation, we must avoid this behavior. Having full con- 
trol over the VM’s perception of time we instead provide 
them with a consistent and continuous logical time. This 
leaves us with the possibility of transparently suspending 
the execution of a prototype without the implementation 
noticing the actual gap in real-world time. 


2.3 Event-based Network Simulation 


The key task of the network simulation is to model the 
network that connects the virtual machines. Following 
the terminology of Fall [10], we distinguish between an 
Opaque and a protocol-aware network emulation mode. 
In the case of opaque network emulation, the simula- 
tor merely influences the propagation of network traffic, 
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for example by delaying or duplicating packets. This ap- 
proach is prevalent in many available tools [1, 2,5, 30]. 
By contrast, we focus on protocol-aware network emula- 
tion. In this case, the network simulation implements the 
communication protocols that are used by the VM proto- 
types. This enables the provision of simulated hosts that 
interact with the VMs. 

For integrating an event-driven network simulation 
with a SliceTime setup, it needs to be interfaced to both 
the synchronization component (timing control inter- 
face) and the virtual machines (data communication in- 
terface). The timing control interface 1s tightly coupled 
with the event scheduler of the simulator. Recall that 
an event-based network simulator maintains a list of all 
scheduled events ordered by the time of execution. Typ- 
ically, the simulation simply processes these events se- 
quentially until the event queue is empty. In SliceTime, a 
custom scheduler checks if the next event’s time of exe- 
cution resides in the current time slice. If this is the case, 
the event is executed. If not, the event scheduler notifies 
the synchronization component through the timing con- 
trol interface. The next event is processed after the barrier 
has been shifted past the execution time of the event. 

The data communication interface connects the simu- 
lation and the virtual machines on the protocol level. The 
functional integration between the VMs and the network 
simulation takes place at gateway nodes inside the sim- 
ulation, a concept adapted from [10]. These nodes can 
be viewed as a simulation’s internal representation of the 
virtual machine they are connected to. Their real func- 
tionality is inside the virtual machine and their purpose is 
to have a communication endpoint inside the simulation 
at which the packet exchange with the virtual machines 
takes place. 

For performance reasons, many network simulation 
frameworks use custom data structures to model a net- 
work packet, and encapsulation is mostly expressed us- 
ing pointers to secondary message structures. In contrast, 
real systems exchange binary information, for example, 
Ethernet frames. When a binary packet generated by a 
VM arrives at the simulator, the gateway node takes care 
of converting it into a network simulation message. Sim- 
ilarly, an outgoing packet must be serialized in an ade- 
quate fashion before it leaves the simulation. 


3 Implementation 


We now discuss our implementation of SliceTime com- 
prising three types of main components (see Figure 3): 
a synchronization component (synchronizer), the vir- 
tual machine infrastructure and a network simulation. 
Two different flows of communication are present in 
our system. The synchronizer delivers the synchroniza- 
tion information over the timing control interface using a 
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lightweight signaling protocol. A tunnel that carries Eth- 
ernet frames from the VMs to the simulation and vice 
versa serves as our data communication interface. The 
VM implementation is based on the Xen hypervisor and 
executes multiple instances of guest domains which host 
an operating system and a prototype implementation. 
Our implementation uses the ns-3 network simulator to 
model the network to which the VMs are connected. For 
this purpose we extend the existing emulation framework 
of ns-3 for synchronized network emulation. 


3.1 Synchronization component 


The synchronizer is implemented as a user-space appli- 
cation. Its main purpose is to implement the timing con- 
trol interface. The synchronization component assigns 
discrete slices of run-time to the simulation and to the 
virtual machines. In order to distribute the synchronized 
components across different physical systems, the syn- 
chronization signaling is implemented on top of UDP. 
In addition to the synchronization coordination, the syn- 
chronizer also manages the set of synchronized compo- 
nents. In particular, it allows peers to join and to leave 
the synchronization during run-time. This allows to run 
certain tasks (e.g., booting and configuring a virtual ma- 
chine and the hosted software prototype) outside the the 
synchronized setup. 


3.1.1 Timing Control Interface 


One challenge is the large amount of messages that needs 
to be exchanged between the synchronized VMs and the 
simulation. For example, if the time slices are config- 
ured to a static logical duration of 0.1 ms, the synchro- 
nization component needs to issue 10000 time slices to 
all attached VMs and the simulation for one second of 
logical time. An additional massive amount of messages 
is caused by the synchronized peers to signal the com- 
pletion of every time slice individually to the synchro- 
nizer. Therefore, in order to maintain a good run-time 
efficiency, it is vital to limit the delays and the overhead 
caused by synchronization signaling and message pars- 
ing. For these reasons, we created a lightweight synchro- 
nization protocol based on UDP for SliceTime. It pro- 
vides all communication primitives of the timing con- 
trol interface. The assignment of time slices to all syn- 
chronized peers is carried out using UDP broadcasts, 
while the remaining communication, such as signaling 
time slice completion, takes place using unicast data- 
grams. Moreover, the UDP packets have a fixed structure 
and only carry the synchronization information in binary 
form. This is necessary to keep both the packet size and 
the parsing complexity at a very low level. 
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Figure 3: SliceTime consists of a central synchronization unit, at least one network simulation based on ns-3 and one 
or more Xen hypervisor systems serving as the VM infrastructure. 


3.2 Virtual Machines 


We use the Xen hypervisor and its scheduling mecha- 
nisms as the basis of our work. Xen is a virtual machine 
monitor for x86 CPUs. The hypervisor itself takes care 
of memory management and scheduling, while hardware 
access 1s delegated to a special privileged virtual machine 
(or domain, in Xen’s parlance) running a modified Linux 
kernel. As the first domain that is started during booting, 
it is often referred to as dom0. 

Xen supports two modes of operation: para- 
virtualization mode (PVM) and hardware virtualization 
mode (HVM). SliceTime uses Xen HVM domains for 
virtualizing operating systems and software prototypes. 
In contrast to para-virtualization, HVM Xen domains do 
not require the kernel of the guest system to be modified 
for virtualization. This allows any x86 OS, also closed 
source OS such as Windows, to be incorporated into a 
SliceTime set-up. 

We now describe the main parts of our work: a) the 
data communication interface to couple virtual machines 
and the simulator, b) the synchronization client that in- 
terfaces with the synchronization component, and c) the 
changes necessary to transparently interrupt and restart 
the VM to align its execution speed to the run-time per- 
formance of the simulator. 


3.2.1 Data Communication Interface 


For the network data communication between virtual ma- 
chines and simulation, it is first important to note that ev- 
ery virtual machine can have one or several virtual net- 
work interfaces that look like real interfaces to the virtual 
machine, and can be accessed inside dom0. We bridge 
the virtual interface in the dom0 with a tap device and 
redirect all Ethernet traffic from the VM to the computer 
running the simulation. Conversely, all Ethernet frames 
received from the simulation over the tunnel are fed back 
to the virtual machine in the same way. 


3.2.2. The Xen Synchronization Client 


To keep the VM in sync with the communication, the 
synchronization component communicates with a syn- 
chronization client on the machine running Xen. Be- 
cause of the potentially high number of synchronization 
messages (depending on the the size of the chosen time 
slices), the performance of the synchronization clients is 
crucial to the overall performance of the system. For this 
reason, the client was implemented as a Linux kernel 
module. This is especially beneficial because Xen del- 
egates hardware access to the privileged domain dom0. 
Therefore, the implementation in kernel space of the 
privileged domain saves half of the otherwise necessary 
context switches for communication and our VM imple- 
mentation. Since context switches (between user space, 
kernel space, and, in addition here, hypervisor context) 
are expensive operations, halving the number of them has 
a very noticeable impact on the overall performance. 

The client communicates with the synchronization 
component via UDP datagrams as described in Section 
3.1.1. It then instructs Xen’s scheduler via a hypercall 
(the domain-hypervisor equivalent of a user-kernel sys- 
tem call) to start the synchronized domain for the amount 
of time specified by the synchronizer. The client also reg- 
isters an interrupt handler to a virtual interrupt, that is, 
an interrupt that can be raised by the hypervisor. When 
the synchronized domain has finished its assigned time 
slice, the interrupt is raised, the client’s handler is exe- 
cuted, and it can inform the synchronizer via UDP. This 
interrupt-based signaling ensures a prompt processing by 
the involved entities. 


3.2.3 Xen Extensions 


The other tasks necessary for our synchronization 
scheme are carried out within the Xen hypervisor. To 
reach the goals we set forth, it is necessary to be able 
to precisely start and stop the VM’s operation accord- 
ing to the assigned time slices by the synchronization 
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component. However, since operating systems have ways 
to detect the passing of time via hardware support (real 
time clocks, hardware timers etc.), simply stopping and 
restarting the VM will not lead to the desired effect. It 
will still be aware of the passing of time while it was 
stopped, and therefore, operations that depend on time 
information (e.g., time-outs of TCP connections) will 
still occur at the wrong times. Therefore, to reach trans- 
parency, it is not only necessary to be able to start and 
stop VMs accurately, but also to provide a consistent and 
steady perception of time for the VM. Hence, all time 
sources of the VM must be controlled and adjusted in 
the hypervisor. 

To reach the first goal, that is, starting and stopping 
VMs and running them for precise number of times, we 
extended the sEDF (simple earliest deadline first) sched- 
uler that is part of the Xen hypervisor. Schedulers in Xen 
schedule VMs in a similar fashion to an operating sys- 
tem’s scheduler. In particular, the sEDF maintains peri- 
odical deadlines for each domain, and an amount of time 
the domain has to be executed up to that deadline. To 
manage the domains, it utilizes several queues. A run 
queue contains all domains that still need to run some 
time until their next deadline; once this constraint is ful- 
filled, a domain migrates to the wait queue until it reaches 
its deadline, at which point it rejoins the run queue with 
a new deadline and required execution time. 

However, the synchronized domains have to be kept 
outside this periodical scheme, because these are only 
scheduled when the synchronization component issues 
the instruction to do so. Therefore, we introduced another 
queue, the sync queue, which works as a replacement of 
the wait queue for synchronized domains. These domains 
stay on that queue until they are to be scheduled again, 
then migrate to the run queue, and back to the sync queue 
afterwards. This way, synchronized domains can be kept 
outside the normal scheduling on non-synchronized do- 
mains. Hence non-synchronized domains may coexist 
with synchronized domains on the same physical ma- 
chine. 

One issue that originally impaired precise timing in 
the low microsecond range was rooted in the original 
implementation of the Xen scheduling subsystem. The 
Xen scheduler assumes itself to run instantly, not con- 
suming any time. Therefore, a time stamp at the begin- 
ning of the execution of the scheduling loop was taken. 
This was considered the point of time the next scheduled 
domain was started. Therefore, time spent in the sched- 
uler was attributed to the domain chosen for execution. 
We changed this to take a time-stamp before the con- 
text switch to the domain. This causes the time spent in 
the scheduler not to be attributed to any domain, there- 
fore increasing accuracy. In addition, our modified sEDF 
scheduler records overall assigned run-time and adjusts 
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itself to the small (generally sub-microsecond) inaccu- 
racies that are inherent to Xen’s timer management and 
lead to slightly early or late returns from the scheduled 
VM to the hypervisor. 

To reach the second goal, that is, masking the passing 
of time from VMs while being stopped, different changes 
had to be applied to the Xen hypervisor. In fact, one of 
the reasons we decided to use a virtualization approach 
for SliceTime was the specific characteristic of decou- 
pling a virtualized operating system from the hardware 
it, under normal circumstances, directly interfaces with. 
This way, we can modify the information that the OS 
receives from the hardware time sources, and therefore 
reach our goal of masking the passing of time. 

To facilitate this masking, we have to amend the two 
main sources of time keeping: time counters and in- 
terrupt timers. Within the modified scheduler, we take 
timestamps whenever a domain is scheduled and un- 
scheduled. This allows us to keep track of the total 
amount of time the domain was not running since the 
start of the synchronization. This delta value is subtracted 
from the counter that domains use to measure the pass- 
ing of time; in the case of Xen and HVM domains, this 
measurement is chiefly based on the time stamp counter 
(TSC), a CPU register whose value increases at regu- 
lar intervals. Modern CPUs with hardware virtualization 
support allow the virtualization of the TSC, which allows 
us to change its value as realized by the VM by subtract- 
ing the delta value. This way, the TSC progresses in a 
linear fashion, even if the domain is unscheduled for ex- 
tended amounts of time. 

Timers, the second source of time keeping, must also 
appear to act as if the domain was running continuously. 
To facilitate this, the same scheduler timestamps are used 
to keep track of the time the domain was last unsched- 
uled. Whenever a domain is unscheduled, all timers that 
belong to it are stopped; in particular, all timers that be- 
long to the virtualized hardware timers such as the RTC 
and APIC timers. When the domain is rescheduled again, 
the time delta since the last unscheduling is added to the 
expiry time of all timers, after which they are reactivated. 
This way, timers expire at the correct point of virtual 
time, upholding the notion of linearly progressing time. 


3.3. Network Simulation 


SliceTime relies on ns-3 as network simulator, as opposed 
to our preliminary work [39,40] in which OMNeT++ was 
used. In contrast to OMNeT++ and the vast majority of 
all event-based network simulators, ns-3 internally repre- 
sents packets as bit vectors in network byte order, resem- 
bling real-world packet formats. This removes the need 
of explicit message translation mechanisms and simpli- 
fies the implementation of network emulation features. 
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The modular design of ns-3 facilitates the integration of 
the additional components as needed by SliceTime. The 
timing control as well as the communication interface are 
implemented as completely separate components whose 
implementation is not intermingled with existing code. 

There are some similarities between the Slice- 
Time simulation components and the emulation features 
already provided by ns-3. Both have to synchronize the 
event execution to an external time source. For the exist- 
ing emulation implementation of ns-3 this is the wall- 
clock time. In the case of SliceTime the synchronizer 
acts as external time source. The so called simulator 
implementations in ns-3 are responsible for scheduling, 
unqueuing and executing events. There is one which 
does this in a standard manner and another one for real- 
time simulations (1.e., synchronized to wall clock time). 
Which of these is used is determined by setting a global 
variable in the simulation setup. 

We added a third simulator implementation that con- 
nects arbitrary ns-3 simulations to the timing control in- 
terface. The simulation registers at the synchronizer be- 
fore its actual run begins. Similarly, the simulation dereg- 
isters itself at the synchronizer after all events have been 
executed. Upon the execution of an event, our implemen- 
tation checks whether its associated simulation time is in 
the current time slice. If this is not the case, it sends a fin- 
ish message to the synchronizer and waits for the barrier 
being shifted. The actual communication with the syn- 
chronizer is encapsulated in a helper class which holds 
a socket, provides methods to establish and tear down 
a connection and to exchange the synchronization mes- 
sages. Another modification is the provision of a method 
which schedules an event in the current time slice. This 
is needed because the regular scheduling methods only 
provide the time of the last executed event, which can be 
wrong in case of network packets arriving from outside 
the simulation. 

The ns-3 simulator already provides two mechanisms 
for data communication with external systems. Both can 
be used with real-time simulations and synchronized em- 
ulation. The emulation net device works like any ns-3 
network device, but instead of being attached to a simu- 
lated channel, it is attached to a real network device of 
the system running the simulation. In contrast to this the 
tap bridge attaches to an existing ns-3 network device 
and creates a virtual tap device in the host system. With 
both mechanisms, packets received on the host system 
are scheduled in the simulation and packets received in 
the simulation are injected into the host system. 

Besides supporting these existing two ways, we added 
a synchronized tunnel bridge. It implements the data 
communication interface and connects the simulation to 
a remote endpoint. The endpoint is usually formed by a 
VM, however the tunnel protocol could also be used to 


interconnect different instances of ns-3. Again the actual 
communication is encapsulated in a helper class. This is 
not only to keep the bridge itself small, but also to re- 
duce the number of sockets needed. In a scenario where 
multiple tunnel bridges are installed inside a simulation 
it is sufficient to have one instance of this helper class. 
Outgoing packets are sent through its socket to a destina- 
tion specified by the bridge sending the packet. Incoming 
packets are dispatched by an identifier included in our 
tunnel protocol and then scheduled as event in the cor- 
responding bridge to which the sender of the packet is 
connected. Since incoming packets are not triggered by 
an event inside the simulation but can occur at any time, 
there is a separate thread running which uses a blocking 
receive call on the socket. This technique has the advan- 
tage to avoid polling and is also used by the emulation 
net device and the tap bridge. 


4 System Evaluation 


We now examine the achievable accuracy of SliceTime. 
First, we look into the timing precision and the accuracy 
of the perceived throughput. Later on, we also measure 
the performance impact introduced by the synchroniza- 
tion process on the general run-time performance. We 
further investigate how it affects the perceived CPU per- 
formance on a VM. All experiments were carried out in 
a testbed of four Dell Optiplex 960 PCs, each equipped 
with a 3GHz Intel Core2 Quad CPU and 8 GB of RAM, 
either executing our VM implementation based on Xen 
or ns-3 with our synchronization extensions. The PCs 
were interconnected using Gigabit Ethernet. Regarding 
the VMs, we used Linux 2.6.18-xen for the control do- 
main as well as the guest domains. 

Most importantly, SliceTime needs to produce valid re- 
sults for any run-time behavior of both the simulation and 
the VMs attached. For this purpose, we investigate two 
performance metrics at different levels of synchroniza- 
tion accuracy. The round-trip time between a simulation 
and a VM as well as the TCP throughput of two VMs 
which are communicating using TCP over a simulated 
network. 


4.1 Timing Accuracy 


In our first experiment, we captured 1 500 ICMP Echo 
replies (Pings) between a VM and a simulated host for 
different simulated link delays and time slice sizes. Fig- 
ure 4 shows the measured RTT distributions for a fixed 
time slice size of 0.1 ms. We visualize the RTT distribu- 
tions using standard box plots. The boxes are bounded 
by the upper and lower quartile of the corresponding 
RTT distribution. The box represents the middle 50% of 
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Figure 5: RTT distributions for different time slice sizes: 
smaller time slices lead to a higher synchronization ac- 
curacy and less variance in the measured RTTs. 


the RTT measurements and its width is given by the in- 
terquartile range (QR). The whiskers visualize the low- 
est and the highest RTT measured within an interval of 
1.5 IQR. 

If no simulation delay is present, most RTTs fall into 
a small range around 0.2 ms. We term this the base delay 
and it comprises time for processing and packet prop- 
agation. At all other simulation delays, the median and 
the RTT distributions are correctly shifted by the sum of 
twice the simulated link delay. For every series, few out- 
liers are well above the expected range. We explain these 
deviations with the non-deterministic processing delay of 
ICMP frames inside the VM’s protocol stack. Figure 5 
displays the relation of the chosen time slice size and the 
resulting RTT distributions for a fixed simulated link de- 
lay of 0.5 ms and a variable time slice size. 

As expected, the variation decreases for smaller time 
slices and converges towards the expected value of twice 
the simulated link delay plus the base delay. First, this 
result clearly demonstrates that a higher synchronization 
accuracy directly impacts the accuracy of the measure- 
ments themselves. Second, we see that it is important 
to choose the time slice size considerably smaller than 
the simulated link delay. Hence, the correct choice of the 
adequate slice size is a crucial parameter of SliceTime. 
For the simulation of many WAN scenarios (e.g., Inter- 
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Figure 6: Network Throughput at different time slice 
sizes: the synchronization does not affect the throughput 
perceived by the VMs. The measured throughput on the 
VMs corresponds to the simulated link capacity. 


net services) time slices in the range between 0.1 ms and 
2ms are sufficient, as RTTs are mostly in the range of 
several milliseconds. 


4.2 Throughput Accuracy 


We now evaluate the accuracy of our implementation re- 
garding the network throughput perceived by the VMs. 
For this purpose we use a small ns-3 simulation, con- 
sisting of one IP node to which two gateway nodes are 
attached using full-duplex CSMA/CD channels. To each 
of those two gateway nodes, one VM is connected. Using 
the netperf [19] TCP_STREAM benchmark, we measured 
the throughput between both VMs. Figure 6 shows the 
results for different simulated channel bandwidths and 
varied time slice sizes. The data points are averages over 
10 netperf runs, with every run lasting 20 seconds. 

Most notably, the synchronization is transparent to the 
VMs in terms of perceived TCP bandwidth, as the time 
slice size has practically no influence on the measured 
TCP throughput. In addition, the throughput measured 
on the VMs very well reflects the simulated channel 
bandwidth. On average, the measured net throughput on 
the VMs is 5.4% lower than the simulated link capacity. 


4.3. Synchronization Overhead 


Because synchronized VMs are not operating in real- 
time, we now analyze the overhead in terms of actual 
run-time penalties introduced by the synchronization. We 
measured the real-time duration for 120 seconds of logi- 
cal time issued to the VMs by the synchronizer. All VMs 
were executed on the same physical machine. We calcu- 
lated the overhead ratio (OR) by dividing the consumed 
real-time by the logical run-time. Figure 7(a) displays the 
OR of one and two VMs (HVM mode) for varying time 
slice sizes. Up to a size of 0.5 ms, the synchronization 
overhead remains below 10%, which 1s still close to real- 
time behavior. For smaller slice sizes, VMs need to be 
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Figure 7: Overhead introduced at the VM at different 
synchronization levels: we observe less than 10% of run- 
time overhead for time slices greater than 0.5 ms. The 
overhead is linear in the number of VMs on one physical 
machine. 


suspended and unpaused more frequently, and the mes- 
saging overhead increases. This leads to a higher OR. 

The parallel execution of several VMs per physical 
machine is not the main objective of our work. Never- 
theless, our implementation nevertheless facilitates such 
configurations. Figure 7(b) shows the OR also for a 
higher number of VMs. The increase of the OR is lin- 
ear in the number of VMs for all time slice sizes. This is 
a straight consequence of our scheduling policy. Even if 
a system is equipped with multiple processors or cores, 
VMs are always executed in a pure sequential order. This 
is a limitation of our current implementation and we 
regard the parallel execution of multiple synchronized 
VMs as future work. 


4.4 CPU Performance Transparency 


One of the major reasons for the run-time efficiency of 
SliceTime is given by the fact that the VMs, once sched- 
uled, are executed natively on the host machine instead 
of a full simulation of system hardware. While we have 
previously shown that the integration with the network 
simulation is accurate in terms of timing and network 
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Figure 8: CoreMark CPU Benchmark score at different 
time slice sizes: For smaller time slices, the CPU perfor- 
mance of a VM decreases due to an increased amount of 
L2 cache misses. Please note the inverted y-axis on the 
right. 


bandwidth, we now investigate the transparency of our 
VM implementation regarding the perceived CPU per- 
formance within a VM. In an ideal case, the perceived 
CPU performance of a VM would be invariant at differ- 
ent levels of synchronization accuracy. 


In order to quantify the CPU performance of a VM, we 
executed CoreMark [34] inside the synchronized VM. 
CoreMark is a synthetic benchmark for CPUs and micro- 
processors recently made available by the Embedded Mi- 
croprocessor Benchmark Consortium (EEMBC). It per- 
forms different standard operations, such as CRC cal- 
culations and matrix manipulations, and outputs a sin- 
gle CPU performance score. Figure 8 shows the Core- 
Mark score for different time slice sizes. Most notably, 
the CPU performance is rather stable above time slices 
of 0.2 ms. For a time slice size of 0.1 ms, the impact of 
the synchronization still is less then 5%. However, for 
small values, the CPU performance decreases rapidly. At 
the highest measured accuracy level (0.01 ms), the Core- 
Mark score drops to about 73% of the score of an unsyn- 
chronized VM on the same hardware. 


We further investigated this effect using OProfile [28] 
and its XenoProf [25] extension. By concurrently execut- 
ing OProfile in the control domain while CoreMark was 
running inside the VM, we were able to trace internal 
CPU events caused by the VM. This way, we identified 
an increased amount of L2 cache misses to cause the ob- 
served performance degradation. As shown in Figure 8, 
the number of L2 cache misses is negatively correlated to 
the measured CoreMark scores. For smaller time slices, 
the CPU needs to be switched more frequently between 
the execution of the VM and the control domain, thus 
decreasing the efficiency of L2 caching. Although this is 
a conceptual issue, we argue that the effect is negligible 
for time slices down to 0.1 ms. This means that for the 
vast amount of application scenarios that will use larger 
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Figure 9: Simple P2P Network: the simulation consisted 
of one VM and 15000 simulated nodes (60 backbone 
nodes with 250 host nodes each) 
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Figure 10: Throughput between VM and simulated hosts 
at different hopcounts 


slices, this minimal performance reduction will have no 
negative influence on the produced results. 


5 Applications 


We now describe two typical use cases for SliceTime. 


5.1 Simple P2P Network 


A core motivation of our work is to enable large-scale 
network emulation setups on customary hardware. In or- 
der to stress our framework in this direction we first ap- 
plied our framework to a large-scale WAN scenario in 
which 15000 simulated nodes exchange data in a P2P- 
like fashion. Due to the simulation size and event load, 
the whole setup executes about 15 times slower than real- 
time. For this experiment we used just two of the four 
testbed machines (cf. Section 4). One machine executed 
the VM infrastructure and the synchronizer while the 
simulation was running on the other one. Figure 9 illus- 
trates the two-tier topology we used, consisting of 60 in- 
terlinked backbone nodes, to which 250 host nodes each 
are attached via an access router. All host nodes act both 
as HTTP servers and HTTP clients, requesting a random 
number of 64kb data blocks from each other. To one of 
the access routers we connect one VM that runs a stan- 
dard Linux distribution. The synchronization accuracy 
was set to 0.Ims. Using the standard curl command- 
line tool we measured the HTTP throughput between the 
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virtual machine and simulated hosts at different hop dis- 
tances (see Figure 10). The observation of the throughput 
decreasing for higher hop counts is expected and rather 
straightforward. However, our point here is a different 
one. First, we achieve valid and consistent measurements 
on the VM despite both the simulation and the VM op- 
erating only at a fraction of wall-clock time. Second, 
this simple example shows that SliceTime enables one 
to evaluate real-world networking software in a large- 
scale simulated context at low hardware and minor setup 
costs, especially if compared with equally sized physical 
testbeds or simulation hardware capable of executing the 
same simulation in real-time. 


5.2. WiFi Software 


SliceTime enables investigations of WiFi software for 
Linux in a fully isolated, deterministic and reproducible 
context. The 802.11 software is deployed on a set of 
VMs, while the network simulation models the wireless 
channel, the medium access control as well as potential 
node movement. In addition, the network simulation can 
optionally be used to also model other parts of the net- 
work, such as 802.11 access points, other mobile hosts 
or an arbitrary wide-area network connecting the 802.11 
infrastructure. In the following, we briefly describe the 
802.11 extensions of SliceTime before we use our frame- 
work to remodel a real-world field test of an AODV rout- 
ing daemon for Linux. 


5.2.1 SliceTime 802.11 extensions 


To enable WiFi support in SliceTime we designed a sec- 
ond data communication interface (cf. Section 3.2.1). 
Figure 11 illustrates its core components and layers. 
On the VM a loadable kernel module forms the Slice- 
Time device driver that provides a virtual WiFi interface. 
The device driver implements the 802.11 wireless ex- 
tensions for Linux network devices. This makes the vir- 
tual WiFi interface look like a real wireless networking 
card. For example, commands such as iwconfig may 
be used to put the virtual WiFi device into monitor mode. 
The actual WiFi software may directly access this inter- 
face or rely on the Linux TCP/IP stack for its commu- 
nication purposes. So-called WiFi gateway nodes repre- 
sent the VMs inside the simulation. The WiFi gateway 
nodes perform all 802.11 MAC layer operations, for in- 
stance sending ACKs, that are normally carried out by 
WiFi hardware. A major benefit of this approach is that 
all communication events being sensitive to strict timing 
constraints remain in the simulation domain. Typically 
a relatively loose VM-simulation synchronization accu- 
racy of 0.5ms and hence low overhead is sufficient for 
most SliceTime WiFi set-ups. By contrast, implement- 
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Figure 11: SliceTime provides a virtual network device to 
the VMs that integrates with ns-3 at the MAC layer. This 
facilitates testing arbitrary WiFi and networking software 
with, for example, reproducible channel conditions and 
node movement. 
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Figure 12: Real-World AODV experiment vs. remodeled 
SliceTime scenario: the hopcount distribution of received 
packets obtained from the scenario remodeled with Slice- 
Time well matches the hopcounts measured in the real- 
world scenario. 


ing the MAC behavior in the driver would require a syn- 
chronization accuracy lower than the 802.11 inter-frame 
spaces (IFS). Despite the IFS being smaller than the max- 
imum synchronization accuracy of SliceTime, the high 
messaging overhead for such tight intervals would also 
render such a design impractical. 


Besides implementing the data exchange between the 
VM device driver and the ns-3 simulation model, the 
WifiEmuBridge also maps configuration actions such as 
triggered by iwconfig to corresponding operations in 
ns-3. In addition it is able to export packet-level statistics 
such as RSSI values to the software running on the VM 
using Radiotap packet headers. A more elaborate discus- 
sion of our ns-3 WiFi emulation extensions can be found 
in [38]. 


5.2.2 AODV routing daemon study 


We used SliceTime to remodel the AODV part of a real- 
world field test [13] in which different mobile ad-hoc net- 
work (MANET) routing protocol implementations were 
evaluated. In the original experiment volunteers on an 
athletic field carried around 33 laptops running an AODV 
daemon. The AODV routing daemon used the 802.11b 
ad-hoc demo mode for link layer communication. During 
the experiment the mobile nodes recorded both routing 
and traffic statistics as well as GPS traces to log the node 
mobility. Corresponding trace files are publicly available 
at the CRAWDAD repository [14]. To remodel the orig- 
inal experiment entirely in software using SliceTime we 
set up 33 VMs executing the AODV software bundled 
with the trace files from CRAWDAD. The AODV dae- 
mon was configured to use the virtual WiFi NetDevice 
of SliceTime. We implemented a corresponding simula- 
tion scenario in ns-3, which used the ns-3 log distance 
propagation loss model and random fading for model- 
ing the wireless channel. In addition we extended ns-3 
with a mobility model that reproduces the nodes’ mo- 
bility according to the GPS traces. We only used one of 
our testbed machines for this experiment. It hosted all 
33 VMs, the synchronizer and the ns-3 simulation. The 
synchronization accuracy was configured to 0.5 ms. Fig- 
ure 12 compares the AODV hopcount distributions of 
received packets for the real-world data and the corre- 
sponding remodeled scenario. The hopcounts measured 
using SliceTime well match the observations from the 
real-world field test. We also determined the average 
packet delivery ratio (PDR) for the real-world experi- 
ment and the emulated scenario. From the CRAWDAD 
traces we calculated the avg. PDR to be 42.10% for the 
real-world AODV experiment. In our remodeled scenario 
the avg. PDR amounts to 46.39%. 


There will always be differences between real-world 
measurements and observations taken with systems such 
as SliceTime. This is a direct consequence of the dispar- 
ity between the real world and the environment modeled 
in software. The 802.11 model of ns-3, for example, is 
relatively sophisticated and quite accurately reproduces 
the behavior of the 802.11 MAC and PHY layers. How- 
ever, there are many factors that are not considered by 
our remodeled scenario, like antenna characteristics or 
even a hypothetical nearby microwave that could have 
influenced the real-world measurements. 


Nevertheless, this use case shows that SliceTime 1s 
well able to provide a testing environment for 802.11 
software that delivers results being close to reality. Re- 
peating real-world experiments like the one conducted 
by Gray [13] is costly and often challenging due to con- 
tinually changing conditions, for example, regarding the 
wireless channel. By contrast, SliceTime allows one to ar- 
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bitrarily modify and rerun WiFi software experiments at 
the push of a button. SliceTime is also cost effective com- 
pared to the hardware costs and manpower requirements 
of the original experiment. While the original experiment 
involved around 40 volunteers and the same number of 
laptops, with SliceTime the same experiment can be con- 
ducted on one desktop PC. 


6 Related Work 


Early contributions [1, 17,30, 36] in the field of network 
emulation focus on opaque network emulation in which 
physical network systems are connected to an emulation 
engine that models the network propagation. The model 
affects the packet flow, either by introducing delay, jitter, 
bandwidth limitations, or packet errors. Later contribu- 
tions extend this methodology for the emulation of In- 
ternet paths [31] or use real-world measurements [5] for 
accurately reproducing the behavior of large-scale net- 
works. Opaque network emulation is an effective method 
to investigate the impact of network propagation charac- 
teristics on protocol performance. However, because all 
communicating peers are physical systems, the analysis 
of large-scale scenarios (e.g., P2P and overlay networks) 
with many hosts is difficult. 

Protocol-aware network emulation was introduced by 
Fall [10], proposing the combination of real network sys- 
tems and discrete event-based simulations. This imple- 
mentation has been improved later in terms of timing ac- 
curacy [24]. Protocol-aware emulation features also ex- 
ist for other event-based network simulators [35]. All of 
these implementations are subject to potential simulation 
overload. Kiddle [21] used massive computing power in 
form of hardware to increase the execution speed of the 
simulation to circumvent this problem. While this works 
up to a certain point, our aim is in the opposite direction 
of slowing down the real system, saving on hardware ex- 
penses and setup complexity. 

Erazo et al. recently proposed SVEET! [9], a hybrid 
TCP evaluation environment that integrates Xen-based 
VMs with an SSFNET [8]-based emulation engine. AI- 
though SVEET! involves a mechanism to cope with sim- 
ulation overload, it differs significantly from our work. 
In order to match the execution speed of both the VMs 
and the emulation engine, SVEET! utilizes a static time 
dilation factor (TDF). The TDF is used to throttle down 
the speed of both the simulator and the VMs to the worst- 
case run-time performance of the emulation engine. The 
main drawback here is the need to correctly choose the 
TDF beforehand. If the chosen TDF is too large, the 
run-time is increased without any benefit due to under- 
utilization of system resources. If the chosen TDF is 
too small, simulation overload and time drifts can occur, 
leading to flawed results. In contrast, our approach does 
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not statically throttle the execution speed of any com- 
ponent by a constant factor. Moreover, the conservative 
barrier algorithm used in our work limits the drift of all 
components to the duration of one time slice. 


Different virtualization-based opaque network emu- 
lation approaches have been discussed over the past 
years. ENTRAPID [16] executes multiple instances of 
the FreeBSD network stack in the user space. These vir- 
tual network kernels (VNKs) are wired together and form 
a network emulation environment. As the VNKs are ex- 
ecuted simultaneously and operate in wall-clock time, 
this limits the scalability of this approach. dONE [4] 
proposes the virtualization of time to address this prob- 
lem. Despite this similarity SliceTime differs signifi- 
cantly from both (ONE and ENTRAPID: first, neither 
dONE nor ENTRAPID integrate software prototypes 
with an event-based network simulation at all. By con- 
trast, SliceTime relies on ns-3 as emulation backend. This 
enables the set-up of emulation scenarios that access all 
models and features of the network simulator. Second, in 
opposition to SliceTime, neither (ONE nor ENTRAPID 
allow the investigation of entire network protocol stacks, 
as both draw the line between the emulation environ- 
ment and software prototypes right at the socket layer. 
Diecast [15] and Time Jails [12] facilitate the setup of a 
network emulation testbeds solely based on virtual ma- 
chines. The main advantage compared to the aforemen- 
tioned systems is that they allow one to execute unmod- 
ified software and protocol stacks. Both are an attrac- 
tive option for real-world experiments in which the num- 
ber of nodes exceeds the quantity of physical hosts of 
a testbed. In addition, Diecast not only scales time, but 
also the performance of system components to accurately 
model a realistic hardware behavior profile. SliceTime, 
by contrast, follows a different goal. Instead of virtual- 
izing time to increase the capacity of a physical testbed, 
we employ it for synchronizing a VM with a network 
simulation that forms the emulation engine. This has two 
advantages. First, using a network simulator as backend 
allows us to put concepts such as virtual node mobility 
into action, which is not possible with neither DieCast 
nor Time Jails. Second, the scalability of the simulator 
opens up the possibility of implementing large-scale em- 
ulation scenarios that could not be realized using VMs 
alone without taking up much higher hardware resources. 


Emulab [42] is a well-established large network 
testbed allowing for the evaluation of networked soft- 
ware in different communication environments. Its main 
strength is the ability to specify network scenarios using 
a configuration file which Emulab maps to the testbed 
hardware. In order to reproduce the characteristics of 
networks of many kinds, Emulab also employs opaque 
network emulation between the testbed nodes. In direct 
comparison with SliceTime, Emulab achieves its flexibil- 
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ity by incorporating a huge amount of networked com- 
puters, network infrastructure as well as auxiliary com- 
ponents. We admire the efforts and achievements of its 
creators in this regard. SliceTime instead aims at provid- 
ing a flexible and scalable network experimentation and 
evaluation platform with very modest hardware require- 
ments. This is reflected in our evaluation which at most 
required two Desktop PCs to carry out the large-scale 
WAN experiment. We achieve this goal by scaling execu- 
tion time and by modeling large parts of the scenario us- 
ing the ns-3 simulator. On one hand the use of a simulator 
limits the possible degree of realism due to discrepancies 
between the real world and the corresponding simulation 
models. On the other hand relying on a simulation allows 
the construction of “virtual network testbeds” that are not 
dependant on the availability of physical hardware or real 
network infrastructure. 

Wireless network emulation tools, such as the CMU 
Emulator [20], interconnect antenna connectors of stan- 
dard wireless network hardware via cables. Complex 
hardware, mostly based on FPGAs and DSPs, is used to 
model the wireless channel. While this enables a quite re- 
alistic emulation, it requires complete physical hardware 
for each station. There is also number of pure software- 
based wireless network emulation tools. Most of them, 
such as [26, 29,43], only mimic the propagation of pack- 
ets on the wireless link and do not support simulated 
wireless stations. A few wireless network emulation sys- 
tems [22,32,33] are based on event-based network simu- 
lators. They share some similarities with the WiFi exten- 
sions of SliceTime, but differ significantly in the way they 
interface the software prototypes with the 802.11 simula- 
tion. In [22,32] the 802.11 simulation model is integrated 
with the software at the IP layer, which prevents investi- 
gations of 802.11 software using a different routing pro- 
tocol than IP. VirtualMesh [33] bridges the gap between 
the simulation and the WiFi software at the MAC layer, 
but requires the modification of all applications making 
use of the wireless extensions. By contrast, the 802.11 
add-ons of SliceTime introduce a clean cut between the 
simulation and the prototypes at the MAC layer. This en- 
ables arbitrary WiFi software for Linux to be evaluated 
without any changes to the software. 


7 Conclusion 


In this paper we presented SliceTime, a platform for scal- 
able and accurate network emulation. SliceTime enables 
the detailed analysis of protocol implementations and en- 
tire instances of operating systems inside simulated net- 
works of arbitrary size. We achieve this goal by matching 
the execution speed of software prototypes encapsulated 
in virtual machines to the run-time performance of the 
event-based simulation. Our evaluation has shown that 


SliceTime is accurate as it integrates network simulations 
of any size with VM based prototypes regarding timing 
and network bandwidth in a transparent way. 

SliceTime is resource efficient. We model large parts 
of the experiment with a simulation and match its overall 
execution speed to the available hardware resources. This 
makes it possible to conduct large-scale network emu- 
lation studies with very moderate hardware costs, espe- 
cially if compared to equally sized physical testbeds. 

SliceTime opens up new application areas for network 
emulation. In the past, only event-based simulations ex- 
ecuting in real-time could form a basis for network em- 
ulation. This is not true for the vast majority of network 
simulations. For example, the computation complexity of 
802.11 channel models so far hindered the use of net- 
work emulation for larger WiFi scenarios. By eliminat- 
ing this burden of real-time execution, SliceTime allows 
any simulation to be used for network emulation. We 
have demonstrated that this extends the applicability of 
network emulation to large-scale WAN and 802.11 sce- 
narios. As we believe that SliceTime will be useful for 
a number of researchers and developers, we have made 
the source code available to the public. It can be down- 
loaded at http://www.comsys.rwth-aachen. 
de/research/projects/slicetime. 
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Abstract 


CTrack is an energy-efficient system for trajectory map- 
ping using raw position tracks obtained largely from 
cellular base station fingerprints. Trajectory mapping, 
which involves taking a sequence of raw position sam- 
ples and producing the most likely path followed by 
the user, is an important component in many location- 
based services including crowd-sourced traffic monitor- 
ing, navigation and routing, and personalized trip man- 
agement. Using only cellular (GSM) fingerprints instead 
of power-hungry GPS and WiFi radios, the marginal en- 
ergy consumed for trajectory mapping is zero. This ap- 
proach is non-trivial because we need to process streams 
of highly inaccurate GSM localization samples (aver- 
age error of over 175 meters) and produce an accurate 
trajectory. CTrack meets this challenge using a novel 
two-pass Hidden Markov Model that sequences cellu- 
lar GSM fingerprints directly without converting them to 
geographic coordinates, and fuses data from low-energy 
sensors available on most commodity smart-phones, in- 
cluding accelerometers (to detect movement) and mag- 
netic compasses (to detect turns). We have implemented 
CTrack on the Android platform, and evaluated it on 126 
hours (1,074 miles) of real driving traces in an urban en- 
vironment. We find that CTrack can retrieve over 75% 
of a user’s drive accurately in the median. An impor- 
tant by-product of CTrack is that even devices with no 
GPS or WiFi (constituting a significant fraction of to- 
day’s phones) can contribute and benefit from accurate 
position data. 


1 INTRODUCTION 


With the proliferation of sensor-equipped smartphones, 
the decades-long promise of location-based mobile ser- 
vices and mobile sensing applications is finally becom- 
ing real. Many location-based applications periodically 
probe the device’s position sensor to obtain a stream of 
position samples, and then process this stream to ob- 
tain a trajectory. Examples include crowd-sourced traf- 
fic and navigation applications [15, 33], personalized 
trip management applications [28, 15], fleet manage- 
ment applications [21], and mobile object/asset track- 
ing [11, 34, 7, 19, 25]. The fundamental problem in these 
applications 1s trajectory mapping, where the goal is to 


produce the most likely trajectory—a sequence of map 
segments—traversed by the mobile device. 


If each device could always use a GPS sensor, this 
problem is straightforward because the majority of the 
position samples would usually be accurate to within a 
small number of meters. For applications that require po- 
sitions to be monitored continuously, however, GPS has 
some significant practical limitations. First, GPS chipsets 
on today’s mobile devices consume a non-trivial amount 
of energy, causing a significant reduction in battery life 
(82). Second, in many embedded tracking applications, 
objects are packaged deep inside vehicles and do not 
have a clear line-of-sight to GPS satellites e.g., anti-theft 
systems on vehicles (often hidden under layers of metal), 
systems that track couriered packages [11] and systems 
like TrashTrack [34] for tracking waste and recycled ma- 
terials. Most of these tracking applications also face en- 
ergy and cost constraints. Third, antenna limitations on 
commodity mobile devices cause poor GPS performance 
in “urban canyons” and near high-rise buildings. Finally, 
a large number of phones today simply do not have GPS 
on them—85% of phones shipped in 2009, and projected 
to be over 50% for the next five years [6]. The users of 
these devices, a disproportionate number of whom are in 
developing regions, are largely being left out of the many 
new location-based applications. 


This paper describes the design, implementation, and 
experimental evaluation of CTrack, a system for map- 
ping the trajectory of mobile devices without using GPS. 
The noteworthy aspect of CTrack is that it uses much 
less energy than current approaches, which use GPS, 
WiFi localization [32, 8], or a combination of the two. 
CTrack processes a stream of raw, highly inaccurate po- 
sition samples from mobile devices obtained by finger- 
printing cellular GSM base stations, and matches them 
to segments on a known map in a way that achieves high 
accuracy. The marginal energy cost of gathering a fin- 
gerprint (a list of nearby GSM towers and their signal 
strengths) is zero on mobile phones because the cellu- 
lar radio is usually always on. CTrack optionally aug- 
ments GSM fingerprints with data from one or more of a 
phone’s accelerometer, compass, and gyro, all of which 
consume tiny amounts of energy, using these sensor hints 
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Figure 1: GSM Localization Errors. Raw location sam- 
ples are in red and the true driving path is in black. 


to identify the kind of movement and improve the accu- 
racy of trajectory mapping. 

GSM localization using, for example, the Placelab [8] 
approach, leads to errors of 100—200 meters in dense ur- 
ban areas, and as much as | km in some areas. Such er- 
rors are too large for many applications, which require 
results with sufficient accuracy to pinpoint a specific road 
segment or route driven by a user. Figure | illustrates the 
problem with existing GSM localization. The red points 
are raw locations obtained from our implementation of 
cellular positioning as used in Placelab [8]. The actual 
roads traversed (ground truth) are shown in black. Di- 
rectly reporting raw positions or matching locations to 
the nearest segments in the road map would result in un- 
acceptably low accuracy for the applications mentioned 
at the beginning of this section. 


CTrack makes it possible to use GSM fingerprints for 
accurate trajectory mapping using two novel ideas. Like 
previous approaches e.g. VTrack [32], CTrack matches a 
sequence of GSM tower observations, rather than a sin- 
gle point at a time, using constraints on the transitions a 
moving vehicle can make between locations. However, 
unlike VTrack, which first converts radio fingerprints to 
(lat,lon) coordinates, CTrack matches cellular finger- 
prints directly to a map without first converting them 
into (Jat,lon) coordinates, an insight critical to achiev- 
ing high accuracy. Instead, CTrack uses a two-pass algo- 
rithm. The first pass is a Hidden Markov Model (HMM) 
that divides space into grid cells, and determines the most 
likely sequence of traversed grid cells. The second pass 
uses a different HMM to match the traversed grid cell 
sequence to road segments. 


The second idea in CTrack is to (optionally) fuse in- 
formation from two low-energy phone sensors: the ac- 
celerometer and a compass or gyroscope. CTrack uses 
the compass/gyro to detect if the driving path took a turn, 
and the accelerometer to determine if the user is stopped 
or moving. These sensor hints can correct some common 
systematic errors that arise in GSM localization. 


We implemented CTrack on the Android smartphone 
platform, and evaluated it on nearly 125 hours of real 
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drives (1,074 total miles) from 20 Android phones in the 
Boston area. We find that: 

1. CTrack is good at identifying the sequence of road 
segments driven by a user, achieving 75% precision and 
80% recall accuracy. This is significantly better than 
state-of-the-art cellular fingerprinting approaches [8] ap- 
plied to the same data, reducing the error of trajectory 
matches by a factor of 2.5x. 

2. Although CTrack identifies the exact segment of 
travel incorrectly 25% of the time, trajectories produced 
by CTrack are on average only 45 meters away from 
the true trajectory. This implies that our system is useful 
for applications like route visualization. In this respect, 
CTrack is 3.5x better than map-matching raw cellular 
fingerprints, which results in 156 meters median error. 

3. CTrack has a significantly better energy-accuracy 
trade-off than sub-sampling GPS data to save energy, re- 
ducing energy cost by a factor of 2.5 x for the same level 
of accuracy. 


2 WHY CELLULAR? 


One of the key motivations for CTrack is that it uses sub- 
stantially less energy than GPS. This is to be expected 
from a theoretical standpoint because of the difference 
in effective radiated power (ERP) for the two systems. 
GPS satellites fly in an orbit 11,000 miles above the 
earth, with a transmission power of 50 W, resulting in 
2x 107-1! mw if m2 at the receiver; in contrast, typical cel- 
lular systems register an ERP of up to 10 mW/m* [14]. 
This difference of 117 dB translates directly into energy 
consumption at the receiver, as the difference must be 
compensated by additional processing gain and amplifi- 
cation. The ERP difference also explains why GPS sig- 
nals cannot be acquired without relatively unobstructed 
line-of-sight to orbiting satellites, and why they are more 
sensitive to weather conditions than GSM signals. 


2.1 Energy Measurements 


We performed a simple experiment to quantify the en- 
ergy consumption of each of the sensors of interest — 
GPS, WiFi, GSM, the compass and the accelerometer on 
an Android G1 phone. For each sensor, we wrote an An- 
droid application to continuously sample the sensor at 
some given frequency, as well as continuously query the 
battery level indicator. We charged the phone to 100%, 
configured the screen to turn off automatically when idle 
(the default), and started the application. We used the An- 
droid telephony API to retrieve nearby cell towers and 
their associated signal strength values. 

Figure 2 shows the reported battery life as a function 
of time for four configurations: GPS sampled every sec- 
ond, GPS sub-sampled every two minutes, WiFi scanned 
every second, and the configuration used by CTrack — 
scanning GSM cell towers every second, and the com- 
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Figure 2: Energy Consumption: GPS vs WiFi vs CTrack 
on an Android Phone. 


pass and accelerometer at 20 Hz. CTrack results in a 
saving of approximately 10x in battery life compared to 
GPS every second over 6x compared to WiFi every sec- 
ond. Also, although sub-sampling GPS ever 2 minutes 
saves energy Over continuously sampling it, we show 
later that sub-sampling also hurts accuracy. The battery 
drain curves look irregular because the G1 phone esti- 
mates remaining battery life poorly — the same experi- 
ment on a Nexus One (a later model) showed a similar 
trend, but looked like a straight line for all sensors. 


2.2 Other Energy Studies and Discussion 


The numbers above are consistent many previous stud- 
ies conducted on a range of phones. For example, we 
found [32, 31] that continuously sampling GPS on 
iPhone 3G and 4 resulted in 3-10 hours total battery life 
(iPhone 3G has lower battery life, and screen brightness 
varied in the different papers, resulting in different run 
times even without GPS). Leaving the phone on (with 
screen on) resulted in 10—18 hours of lifetime (this would 
be higher if we could turn the phone’s screen off, but at 
the time, non-jailbroken iPhones did not support back- 
ground applications.) 

In [23], the authors showed that Nokia N95 phones 
use about 370 mW of power when GPS is left on, versus 
60 mW when idling, and that continuous (once a second) 
GPS sampling results in 9 hours of total battery life. Sev- 
eral other papers [36, 16, 5, 9, 13] suggest similar num- 
bers for N95 phones (battery life in the 7-11 hour range) 
with regular GPS sampling. On a more recent AT&T Tilt 
phone [18], the authors found that continuous GPS sam- 
pling used 400 mW, a single GPS fix costs 1.4-5.7 J of 
energy (depending on whether previous seen satellite in- 
formation is cached or not) and a WiFi scan consumed 
about 0.55 J of energy. 

The energy cost of GPS is rooted in the need for pro- 
cessing gain to acquire the positioning signals. As signal 
quality degrades due to obstructions or weather condi- 
tions, the energy cost of recovering the signal increases. 
In contrast, because phones continuously track cell tow- 


ers as a part of normal operation, the marginal energy 
cost of CTrack is driven by CPU load. Processing a cell 
tower signature might require at most 100,000 instruc- 
tions, which costs 5 nJ on a current generation 1 GHz 
Qualcomm Snapdragon processor. 

In embedded (non-phone) applications that don’t need 
the radio on, it is possible to track only the signal qual- 
ity and cell ID portions of the GSM protocol. This re- 
quires observing only the BCH slots of the GSM beacon 
channel, which are 4.6 ms long and are transmitted once 
per each 1.8 second cycle. A 10% GSM receiver duty 
cycle should be adequate to track the strongest towers. 
Assuming a GSM receiver uses 17 mA at 100% duty cy- 
cle, this represents an additional power consumption of 5 
mW (1.7 mA @ 2.7 V)amortized cost assuming 17 mA 
cost for receiver circuitry [1, 30]. 

Accelerometers and compasses (magnetometers) also 
have low overhead—for example ADXL 330 accelerom- 
eters use about 0.6 mW when continuously sampling, 
and at 10 Hz can be idle about 90% of the time, suggest- 
ing a power overhead of around .06 mW for sampling the 
accelerometer [2]. The MicroMag3 compass uses about 
1.5 mW in continuous sampling, suggesting a power con- 
sumption of .15 mW or less at 10 Hz [24]. 

In summary, the power consumption of cellular scan- 
ning plus sensors on phones is less than 5 mW, and the 
power consumption of sensors alone if cellular is free— 
as 1s typical—is less than 1 mW, low enough that it does 
not reduce the phone’s overall lifetime even when in 
standby mode, when it consumes 20-30 mW of power. 
In contrast, the best case for GPS is 75 mW in tracking 
mode when a fix is already acquired, but in practice is 
closer to 400 mW when including the energy to periodi- 
cally re-acquire fixes, and is similar for WiFi scans every 
second or two. The power differential is thus significant. 


2.3 Embedded Low-Power Applications 


CTrack can also be applied outside the smartphone con- 
text to embedded low-power tagging applications. For 
these applications, minimizing cost and battery require- 
ments is essential. These applications benefit from using 
GSM in place of GPS because of increased flexibility of 
antenna placement for cellular systems, and resilience to 
obstructed environments. 

One such application is cold-chain management where 
the focus is on monitoring the temperature of a pack- 
age during its shipping. A low-power passive cellular re- 
ceiver can be used to record cellular fingerprints during 
transport. Upon arrival, CTrack can be run on the fin- 
gerprints to compute the shipment’s trajectory and map 
temperature readings on to it. 

Another embedded application of CTrack is Trash- 
Track [34, 7], where items of trash were tagged with 
active tags that traced the items through the path along 
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Figure 3: CTrack System Architecture. 


the “disposal chain’. Because the tag will eventually be 
destroyed, this system needs cellular communication ca- 
pabilities; using the same technology for trajectory map- 
ping consumes lower power, has lower cost, and is more 
robust than adding a GPS receiver to the tag. 


3 SYSTEM OVERVIEW 


We now describe the design of CTrack. Figure 3 shows 
the system architecture. It consists of two software com- 
ponents, the CTrack Phone Library, and the CTrack Web 
Service. The library collects, filters, and scans for GSM 
and sensor data on the phones, and transmits it via any 
available wireless network (3G, WiFi, etc.) to the web 
service, which runs the trajectory mapping algorithm 
on batches of sensor data to produce map-matched tra- 
jectories. The mapping algorithm runs on the server to 
avoid storing complete copies of map data on the mobile 
device, and to provide a centralized database to which 
phone or web applications can connect to view and an- 
alyze matched tracks (e.g., for visualizing road traffic or 
the path taken by a package or vehicle). 


Phone Library: The phone library collects a list of GSM 
towers and optionally, if accelerometer, compass, or gyro 
are available on the phone, current sensor hints. These 
sensor hints are binary values indicating if the phone 
is moving and/or turning; Section 5 describes how we 
extract sensor hints. The phone library also filters ac- 
celerometer data to detect if the user is stationary or 
walking (as in [27, 31]), for applications that want data 
only from moving vehicles. The library may also be con- 
figured to periodically collect GPS data for use in the 
training phase of our algorithm from users who wish to 
contribute. 

Our implementation collects about 120 bytes/second 
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of raw ASCII data on average. This quantity varies be- 
cause the number of cell towers visible varies with lo- 
cation. We use simple gzip compression, which on our 
test drives resulted in just 11 bytes/second of data to be 
delivered. We batch this data and upload a batch every t 
seconds. At 11 bytes/sec, with even small batches, using 
a 3G uplink with an upload speed of 30 kBytes/s (typical 
of most current 3G networks in the US) results in very 
low 3G radio duty cycles—for example, setting ft to 60 
seconds results in the radio being awake only 0.03% of 
the time, which consumes a negligible amount of addi- 
tional power. Once-per-minute (t = 60) reporting is suf- 
ficient for most applications we are concerned with, in- 
cluding traffic reporting, package tracking, and vehicular 
theft detection. 

We chose not to run trajectory matching on the phone 
because it results in a negligible space savings, while 
consuming extra CPU overhead and energy. For low data 
rates, the primary determinant of 3G or WiFi transmis- 
sion energy is the transmitter duty cycle [4], making 
batch reports a good idea. However, we do extract sen- 
sor hints on the phone because the algorithms for hint 
extraction are simple and add negligible CPU overhead, 
while significantly reducing data rate. The raw data rate 
from sampling the accelerometer/compass without com- 
pression or hint extraction is about 1.3 MBytes/hour, 
which means that an application collecting this data from 
a user’s phone for two hours a day could easily rack up a 
substantial bandwidth bill without on-phone filtering. 


CTrack Web Service: The web service receives GSM 
fingerprints and converts them into map-matched tra- 
jectories using the trajectory mapping algorithm. These 
matched trajectories are written into a database. Option- 
ally, the user’s current segment can be sent directly back 
to the phone. A detailed description of the trajectory 
mapping algorithm is given in the next section. 


4 TRAJECTORY MAPPING ALGORITHM 


CTrack’s algorithm for map-matching a sequence of 
GSM cell tower observations (“cellular fingerprints’) 
differs from previous approaches in two key ways. First, 
we do not convert cellular fingerprints into (/at,/on) co- 
ordinates before matching them to segments. We find 
that reducing a fingerprint to a single geographic loca- 
tion loses a lot of information because a given cellular 
fingerprint is often seen from multiple locations quite far 
apart. This situation is unlike the WiFi map-matching in 
VTrack [32], where this spread is small, and the approach 
of converting to centroids worked well. Second, CTrack 
optionally fuses sensor hints from the accelerometer and 
the compass to improve matching accuracy. We show 
that turn hints can help remove spurious turns and kinks 
from GSM-mapped trajectories, and movement hints can 
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Figure 4: Trajectory Mapping Algorithm. 


help remove loops, a common problem with GSM local- 
ization when a vehicle is stationary. 


4.1 Algorithm Outline 


The goal of the algorithm is to associate a sequence of 
cellular fingerprints to a sequence of road segments on a 
known map. Our algorithm takes as input: 

1. A series of GSM fingerprints from the phone, one 
per second in our implementation. In our paper, the 
term GSM fingerprint refers to a set of observed IDs of 
cell towers and their associated received signal strength 
(RSSI) values. In our implementation, the Android OS 
gives us the cell ID and the RSSI of up to 6 neighbor- 
ing towers in addition to the associated cell tower. Each 
RSSI value is an integer on a scale from 0 to 31 (higher 
means higher signal-to-noise ratio). 

2. If available, time series signals from accelerome- 
ter, compass, and gyroscope sampled at 20 Hz or higher. 
These are converted to “sensor hints” using on-phone 
processing as explained below. 

3. A known map database that contains the geogra- 
phy of all road segments in the area of interest, such as 
OpenStreetMaps [22], NAVTEQ, or TeleAtlas. 

The output is the likely sequence of road segments tra- 
versed, one for each time instant in the input. 

Figure 4 shows the components of the algorithm. 
Training builds a training database, which maps ground 
truth locations from GPS to observed cell towers and 
their RSSI values. Grid Sequencing uses a Hidden 
Markov Model (HMM) to determine a sequence of spa- 
tial grid cells corresponding to an input sequence of 
GSM fingerprints. The output of grid sequencing is 
smoothed, interpolated, and fed to Segment Matching, 
which matches grid cells to a road map using a differ- 
ent HMM. 

Figure 5 illustrates our algorithm by example. The in- 
put “raw points” in Figure 5(a) are shown only to illus- 
trate the extent of noise in the input data. They are not ac- 
tually used by CTrack. They are computed by using the 
Placelab fingerprinting algorithm [8], where a cell tower 


fingerprint is assigned a location equal to the centroid 
of the closest k fingerprints in the training database (we 
used k = 4). 

Next, we describe each stage of the algorithm. 


4.2 Training 

We divide the geographic area of interest into uniform 
square grid cells of fixed size g,. We associate with each 
cell an ordered pair of positive integers (x,y), where 
(0,0) represents the south-west corner of the area of in- 
terest. We use g, = 125 meters, chosen to balance run- 
ning time, which increases with smaller grid size, against 
accuracy. 

We train CTrack for the area of interest using software 
on mobile phones that logs a timestamped sequence of 
ground truth GPS locations and associated cell tower fin- 
gerprints. For each grid G in the road map, our training 
database stores Fg, the set of distinct fingerprints seen 
from G. Training can be done out-of-band using an ap- 
proach similar to the Skyhook [29] fleet. Once the train- 
ing database is built, it can be used to map-match or 
track any drive, and needs to be updated relatively in- 
frequently. We can also collect new training data in-band 
from consenting participating phones that use the CTrack 
web service whenever the user has enabled GPS. 


4.3 Grid Sequencing 

Grid sequencing uses a Hidden Markov Model (HMM) 
to determine the sequence of grid cells corresponding 
to a timestamped sequence of cellular fingerprints. An 
HMM is a discrete-time Markov process with a set of 
hidden states and observables. Each state emits an ob- 
servable, whose likelihood is given by an emission score. 
An HMM also permits transitions among its hidden 
states at each time step. These transitions are governed 
by a different set of likelihoods called transition scores. 

In our (first) HMM, the hidden states are grid cells 
and the observables are GSM fingerprints. The emission 
score, E(G,F) captures the likelihood of observing fin- 
gerprint F in cell G. The transition score, T(G,,G2), cap- 
tures the likelihood of transitioning from cell G; to G2 in 
a single time step. 

We first process the input GSM fingerprints using 
the windowing technique described below. We then use 
Viterbi decoding [35] to find the maximum likelihood 
sequence of grid cells corresponding to the windowed 
version of the input sequence. The maximum likelihood 
sequence is defined to be the sequence that maximizes 
the product of emission and transition scores. 

We now describe the four parts of this HMM: window- 
ing, hidden states, emission score, and transition score. 


Windowing. Because it is common for a single cell 
tower scan to miss some of the towers near the current 
location, we group the fingerprints into windows rather 
than use the raw fingerprints captured once per second. 
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We aggregate the fingerprints seen over Weegn Seconds 
of scanning. We chose Wyceg, = 5 seconds empirically: 
the phone typically sees all nearby cell towers within 3 
scans, which takes about 5 seconds. In our evaluation, we 
show that windowing improves accuracy (Table 1). 


Hidden States. The hidden states of our HMM are grid 
cells. Given an observed fingerprint F, a grid cell Gis a 
candidate hidden state for F if there is at least one train- 
ing fingerprint in G that has at least one cell tower in 
common with F’. Note that we might sometimes omit a 
valid possible hidden state G if the training data for G is 
sparse. To overcome this problem, we use a simple wire- 

less propagation model to predict the set of cell towers 
= seen from cells that contain no training data. The model 
computes the centroid and diameter of the set of all ge- 
ographic locations from which each cell tower is seen in 
the training data. The model draws a “virtual circle” with 
this center and diameter and assumes that all cells in the 
circle see the tower in question. 
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Emission Score. Our emission score E (F, G) is intended 
to be proportional to the likelihood that a fingerprint F' is 
observed from grid cell G. A larger emission score means 
that a cell is a more likely match for the observed finger- 
print. Our emission score uses the following heuristic. 
We find F., the closest fingerprint to F seen in training 
data for G. “Closest” is defined to be the value of F, that 
maximizes a pairwise emission score Ep(F, F.). Our pair- 
wise score is inspired by RADAR [3]. It captures both 
the number of matching cell IDs, M, between two fin- 
gerprints, and the Euclidean distance dr in between the 
signal strength vectors of the matching towers: 
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Ep(Fi, Fo) = Minatcn + (dp — dri, F2)) (1) 


where Amarch 18 a Weighting parameter and dz’ = 32 is 
the maximum possible RSSI distance. A higher number 
of matching towers, and a lower value of dr, both cor- 
respond to a higher emission score. The maximum value 
of the pairwise emission score is normalized (described 
below) and assigned as the emission score for F’. 
cee As an example, consider the fingerprints {(ID=1, 
2: RSSI=3), (ID=2, RSSI=5)} and {(ID=1, RSSI=6), 
es (ID=2, RSSI=4), (ID=3, RSSI=10)}. The distance be- 
vy B—6)?+(5—4) 
tween them would be 2Amarcn + (32 — ~——,—_ ). 


The weighting parameter affects how much weight is 
given to tower matches versus signal-strength matches: 
we chose Awarch = 3. 

x We normalize all our emission scores to the range 
; (0,1) to ensure that they are in the same range as tran- 
sition scores, which we discuss next. 
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Figure 5: CTrack map-matching pipeline. Black lines are 
ground truth and red points/lines are obtained from cel- 
lular fingerprints. 
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where d(G , Gz) is the Manhattan distance between grid 
cells G, and G2 represented as ordered pairs (x;,y,) and 
(x2,y2). The transition score is based on the intuition 
that, between successive time instants, the user either 
stayed in the same cell or moved to an adjacent cell. It 
is unlikely that jumps between non-adjacent cells occur, 
but we permit them with a small probability to handle 
gaps in input data. 

Figure 5(b) shows the output of the grid sequencing 
step for our running example. As we can see, sequenc- 
ing removes a significant amount of noise from the input 
data. In our evaluation, we demonstrate that the sequenc- 
ing step is critical (Figure 11). 


4.4 Smoothing and Interpolation 


This component takes a grid sequence as input and con- 
verts it into a sequence of (/at,/on) coordinates that are 
then processed by the Segment Matching stage. 


Smoothing filter. For each grid in the sequence, we cal- 
culate the centroid of the training points seen from the 
grid. The centroid has the following advantage: if there 
is only one road segment in a grid (a frequent occurrence) 
and the training points lie on it, so will the centroid. 
Typically, centroids from grid sequencing have high fre- 
quency noise in the form of back-and-forth transitions 
between grids (Figure 5(b)). Hence, we apply a smooth- 
ing low-pass filter with a sliding window of size Wsmooth 
to the centroids calculated as described above. The fil- 
ter computes and returns the centroid of centroids in 
each window. This filter helps us to accurately deter- 
mine the overall direction of movement and filter out the 
high frequency noise. We chose the filter window size, 
Wsmooth = 10, empirically. 

Interpolation. Earlier, we windowed the input trace and 
grouped cellular scans over a longer window of Wean 
seconds. As a result, the smoothing filter produces only 
one point every Weg, seconds. We linearly interpolate 
these points to obtain points sampled at a 1-second inter- 
val, and pass them as input to the Segment Matching step 
described in 84.5. 

The reason for interpolation is that segment match- 
ing produces a continuous trajectory where each seg- 
ment is mapped to at least one input point. The mini- 
mum frequency of input to the segment matcher is one 
that ensures that even the smallest segment has at least 
one point. The smallest segment in the OpenStreetMaps 
and NAVTEQ maps is roughly 30 meters; so assuming a 
maximum speed of 65 MPH = 105 km/h = 29 m/s, we 
need about once-a-second sampling or higher to ensure 
this condition. Higher speeds than that generally occur 
on freeways where segments are usually longer than 30 
meters. 

Figure 5(c) shows the example drive after smoothing 
and interpolation. This output is free of back-and-forth 


transitions and correctly fixes the direction of travel at 
each time instant. Our evaluation quantifies the benefit 
of smoothing (Table 1). 


4.5 Segment Matching 


Segment Matching maps sequenced, smoothed grids 
from the previous stages to road segments on a map. It 
takes as input the sequence of points from the Smoothing 
and Interpolation phase, and turn and movement hints 
from the phone, to determine the most likely sequence 
of segments traversed. We describe how movement and 
turn hints are extracted in Section 5. 

For segment matching, we use a version of the VTrack 
algorithm [32] augmented to process sensor hints. This 
step also uses an HMM. In this case, the states are the 
set of possible triplets {S,Hy,Hr}, where S is a road 
segment, Hy © {0,1} is the current movement hint, and 
Hr € {0,1} is the current turn hint. 

The emission score of a point (Jat,lon,Hy,H7) from 
a state (S, Hy,,H7) is zero if Hy # Hy, or Hr 4 H>. Oth- 
erwise, we make it Gaussian, with the form Ze where 
D is the distance of (/at,/on) from road segment S. 

The transition score between two triplets {S!,H1,,H}} 
and {S”, H?,, H2} is defined as follows. It is 0 if segments 
S' and S? are not adjacent, disallowing a transition be- 
tween them. This restriction ensures that the output of 
matching is a continuous trajectory. For all other cases, 
the base transition score is 1. We multiply this score 
with a movement Dp enalty ’ Mmovement (O < Mmovement < 1), 
if Hi, = H?, = 0 and S,; ¢ S», to penalize transitions 
to a different road when the device is not moving. We 
also multiply with a turn penalty, Ayyrn(O < Ayuen < 1) if 
the transition represents a turn, but the sensor hints re- 
port no turn. We used Amovement = 0.1 and Ayurn = 0.1. 
Our algorithm is not very sensitive to these values, since 
the penalties are multiplied together and a small enough 
value suffices to correct incorrect turn/movement pat- 
terns. 

Similar to WTrack, the HMM also includes a speed 
constraint that disallows transitions out of a segment if 
sufficient time has not been spent on that segment. The 
maximum permitted speed can be calibrated depending 
on whether we are tracking a user on foot or in a vehicle. 

The output of the segment matching stage is a set of 
segments, one per fingerprint in the interpolated trace 
(which, on average, is the same periodicity as the orig- 
inal input). The output for the running example is shown 
in Figure 5(d). 

When running online as part of the CTrack web ser- 
vice, the segment matcher takes turn hints and sequenced 
grids as input in each iteration and returns the current 
segment to an application querying the web service. 


Running time. The run-time complexity of the entire 
algorithm, including all stages, is O(mn), where m is 
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Figure 6: Movement hint extraction from accelerometer. 


the number of input fingerprints and n is the number of 
search states (the larger of the number of grid cells and 
road segments on the map). Our Java implementation on 
a MacBook Pro with 2.33 GHz CPU and 3 GB RAM 
map-matched an hour-long trace in approximately two 
minutes, approximately 30 times faster than real time. It 
is straightforward to reduce the run time by more aggres- 
sively pruning the search space, but we have not found 
the need to do so yet. 


5 SENSOR HINT EXTRACTION 


CTrack includes a sensor hint extraction layer that pro- 
cesses raw phone accelerometer readings to infer infor- 
mation about whether the phone being tracked is moving 
or not, and processes orientation sensor readings from a 
compass or a gyroscope to heuristically infer vehicular 
turns. These hints are transmitted along with the GSM 
fingerprint to the server for map matching. 


Anomaly detection. Anomaly detection filters out pe- 
riods when the user is lifting the phone, speaking on 
the phone, texting, waving the phone about, or other- 
wise using the phone. We want to use accelerometer and 
compass/gyro data only in periods where we have high 
confidence that the phone is more or less at rest rela- 
tive to the moving object in which it is located (e.g., on 
a flat surface or in a user’s pocket). We found empiri- 
cally that when driving with the phone at rest in a ve- 
hicle or in a pocket, the raw accelerometer magnitude 
tends to be smaller than 14 ms~?. Hence, we look for 
spikes in the raw accelerometer magnitude that exceed 
a threshold of 14 ms~*. Whenever we encounter such a 
spike, we ignore all accelerometer and compass data in 
the map-matching algorithm until the phone comes back 
to a state of rest (this can be detected using standard de- 
viation of acceleration, as explained below). On more re- 
cent phones such as the iPhone 4, the in-built gyroscope 
gives the exact orientation of the phone which can be di- 
rectly read to determine if the phone is on a flat surface/in 
a user’s pocket. 


Having filtered out anomalous periods, the hint extrac- 
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Figure 7: Turn hint extraction from compass. 


tion processes stable periods to extract movement and 
turn hints, as explained below. 


Movement Hints. Our algorithm uses accelerometer 
data sampled at 20 Hz. We extract a simple “static” 
or *moving” (1-bit hint) rather than integrating the ac- 
celerometer data to compute velocities or processing it 
in a more complex way, because accelerometer data is 
noisy and hard to integrate accurately without accumu- 
lating drift. In contrast, it is easy to detect movement 
with an accelerometer: within a stable (spike-free) pe- 
riod, the accelerometer shows a significantly higher vari- 
ance while moving than when stationary. 


Accordingly, we compute a boolean (true/false) move- 
ment hint for each time slot. We divide the data into one- 
second slots and compute the standard deviation of the 
3-axis magnitude of the acceleration in each slot. Di- 
rectly thresholding standard deviation sometimes results 
in spurious detections when the vehicle is static and the 
signal exhibits a short-lived outlier. To fix this, we ap- 
ply an EWMA filter to the standard deviation stream to 
remove short-lived outliers. We then apply a threshold 
Omovement, ON the standard deviation to label each time 
slot as “static” or”*moving”’. We used a subset of our driv- 
ing data across multiple phones as training (where we do 
know ground truth from GPS), to learn the optimal value 
Of Omovement, Which turned out to be approximately 0.15 
ms for one-second windows. Figure 6 illustrates our 
movement hint extraction algorithm on example data. 


Turn Hints. The orientation sensor of a smartphone 
(compass/gyroscope) provides orientation about three 
axes. We are interested in the axis that provides the rela- 
tive rotation of the phone about an axis parallel to gravity 
(called “yaw” on the iPhone 4). 


Because the phone can be in any orientation to start 
with in a handbag or pocket, we do not use the abso- 
lute orientation in any of our algorithms. We have ob- 
served that irrespective of how the phone is situated, a 
true change in orientation manifests as a persistent, sig- 
nificant, and steep change in the value of the orientation 
sensor. 
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Figure 8: Coverage map of our driving data set. 


With compass data, the main challenge is that the 
orientation reported is noisy because metallic objects 
nearby, or because the compass becomes uncalibrated. 
We solved this problem by applying a median filter with 
a three-second window on the raw orientation values, 
which filtered out non-persistent noise with consider- 
able success (a mean filter would also remove noise, but 
would blur sharp transitions that we do want to observe). 
We then find transitions with a magnitude exceeding at 
least 20 degrees and slope exceeding a threshold, which 
we fixed at 1.5 by experimentation. 

Figure 7 illustrates a plot of the compass data with the 
turn marked, and the processing steps required to gener- 
ate a turn hint. We note that even after filtering, a true 
change in orientation can sometimes be produced by the 
phone sliding around within a pocket or a bag, or turning 
for reasons other than the car actually turning. 


6 EVALUATION 


In this section, we show that the trajectory matches pro- 
duced by CTrack are: (1) accurate enough to be use- 
ful for various tracking and positioning applications, (2) 
superior to sub-sampled GPS in terms of the accuracy- 
energy tradeoff, and (3) significantly better than strate- 
gies that reduce cellular fingerprints to point locations 
before matching. We investigate how much each of the 
four techniques used in CTrack —sequencing, window- 
ing, smoothing, and sensor hints—contribute to the gains 
in accuracy. 


6.1 Method and Metrics 


We evaluate CTrack on 126 hours of real driving data in 
the Cambridge-Boston area, collected from 15 Android 
G1 phones and one Nexus One phone over a period of 
4 months. We configured our phone library for the An- 
droid OS to continuously log the ground truth GPS loca- 
tion and the cell tower fingerprint every second, and the 
accelerometer and compass at 20 Hz. Our data set covers 
3,747 road segments, amounts to 1,718 km of driving, 
and 560 km of distinct road segments driven. The data 
set includes sightings of 857 distinct cell towers. Fig- 
ure 8 shows a coverage map of the distinct road segments 
driven in our data set. 


From 312 drives in all, we selected a subset of 53 
drives verified manually to have high GPS accuracy as 
test drives, amounting to 109 distinct km. We picked 
a limited subset as test drives to ensure each test drive 
was contained entirely within a small bounding box with 
dense training coverage. This is because evaluating the 
algorithm in areas of sparse coverage (which many of 
the other 259 drives venture into) could bias results in 
our favor by reducing the number of candidate paths to 
map-match to. For each test drive, we perform /eave-one- 
out evaluation of the map-matching algorithm: we train 
our algorithm on all 311 drives excluding the test drive, 
and then map-match the test drive using CTrack. We do 
this to ensure enough training data for each drive, and at 
the same time to keep the evaluation fair. 

We compare CTrack to two other strategies in terms of 
energy and accuracy: 

1. GPS k gets one GPS sample every k min- 
utes (k = 2,4), interpolates, and map-matches it using 
VTrack [32]. 

2. Placelab-VTrack computes the best static local- 
ization estimate for each time instant using Placelab’s 
technique [8], and matches the static estimates using 
VTrack [32]. The VTrack paper shows that its HMM 
does much better than just matching each point to the 
nearest segment. 

We use three metrics in our evaluation of accuracy: 
precision, recall, and geographic error. Our precision 
and recall are similar to conventional precision and re- 
call, but take the order of matched segments in the trajec- 
tory into account. We say that a subset of segments in a 
trajectory T; that also appears in trajectory 7> are aligned 
if those segments appear in 7; in the same order in which 
they appear in 7>. Given a ground truth sequence of seg- 
ments G and an output sequence X to evaluate (produced 
by one of the algorithms), we run a dynamic program to 
find the maximum length of aligned segments between G 
and X. We define: 

, Total length of aligned segments 

Precision = ££ —————____ (2) 

Total length of X 
Recall — Total length of aligned segments 3) 
Total length of G 
We estimated the ground truth sequencing of segments 
by map-matching GPS data sampled every second with 
VTrack [32], and manually fixing a few minor flaws in 
the results. 


Geographic Error. Precision and recall are relevant to 
applications that care about obtaining information at a 
segment-level, such as traffic monitoring. However, ap- 
plications such as visualization do not need to know the 
exact road segments traversed, but may want to identify 
the broad contours of the route followed (e.g., mistaking 
a road for a nearby parallel road may be acceptable). 
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Figure 9: CDF of Precision: Comparison. 


To quantify this notion, we compute a third metric, ge- 
ographic error, which captures the spatial distance be- 
tween the ground truth and the matched output. We com- 
pute the maximum alignment between the ground truth 
trajectory G and output trajectory X using dynamic pro- 
gramming. This alignment matches each segment S of X 
to either the same segment S on G (if CTrack matched 
that segment correctly) or to a segment Swrong € G (if 
matched incorrectly). Define the segment geographic er- 
ror to be the distance between S and Syrong for incor- 
rect segments, and O for correctly matched segments. The 
mean segment geographic error over all segments in X is 
the overall geographic error. 


6.2 Key Findings 
The key findings of our evaluation are: 

1. CTrack has 75% precision and 80% recall in both 
the mean and median, and a median geographic error of 
44.7 meters. We discuss what these numbers mean in the 
context of real applications below. 

2. CTrack has 2.5 x better precision and 3.5 smaller 
geographic error than Placelab+ VTrack. 

3. CTrack is equivalent in precision to map-matching 
GPS sub-sampled every 2 minutes while consuming over 
2.5 less energy. It also reduces error (1 — precision) by 
a factor of over 2 compared to sub-sampling GPS ev- 
ery 4 minutes, consuming a similar amount of energy. 
CTrack is 6x better than continuous WiFi sampling in 
terms of battery lifetime on the Android platform. 

4. The first step of CTrack, grid sequencing, is criti- 
cal. Without sequencing, CTrack effectively reduces to 
computing a (/at,lon) estimate from the best finger- 
print match, ignoring all other data. The median preci- 
sion without sequencing is only 50%. See Section 6.4 
for more detail. 

5. We can extract movement and turn hints from raw 
sensor data with approximately 75% precision and re- 
call. These hints improve accuracy by removing spurious 
loops and turns in the output. Using hints improves pre- 
cision by 6% and recall by 3%. See Section 6.5 for more 
detail. 


NSDI ’11: 8th USENIX Symposium on Networked Systems Design and Implementation 


we 


CTrack 
GPS 2 min 


08 GPS 4 min 
, Placelab+VTrack 


0.6 


Probability 


0.4 


0.2 





ae 


0 0.2 0.4 0.6 0.8 1 
Recall (Matched Length/True Length) 


Figure 10: CDF of Recall: Comparison. 


6.3 Accuracy Results 


Figure 9 shows a CDF of the map-matching precision 
for CTrack, GPS k (for k = 2,4 minutes) and Place- 
lab+VTrack. CTrack has a median precision of 75%, 
much higher than the both the energy-equivalent strat- 
egy of sub-sampling GPS every 4 minutes (48%), and 
Placelab+ VTrack (42%). In effect, CTrack has over 2 x 
lower error (1 — precision) than sub-sampling GPS ey- 
ery 4 minutes, and over 2.5x lower error than map- 
matching cellular localization estimates output by the 
Placelab method. Also, CTrack has equivalent precision 
to map-matching GPS sub-sampled every two minutes, 
while reducing energy consumption by approximately 
2.5 x compared to this approach (Figure 2). 

Figure 10 shows a CDF of the recall. All the strate- 
gies except GPS 4 min are equivalent in terms of re- 
call. Sub-sampling GPS every four minutes has poor re- 
call (median only 41%) because a four-minute sampling 
interval misses significant turns in our input drives and 
finds the wrong path. The fact that Placelab+ VTrack has 
identical recall shows that simple static cellular localiza- 
tion does manage to recover a significant part of the in- 
put drive. However, converting cellular fingerprints di- 
rectly to points results in significant noise and long-lived 
outliers, and hence produces a large number of incorrect 
segments when map-matched directly (1.e., has low pre- 
cision). 

To understand what 75% precision might mean in 
terms of a an actual application, we refer readers to 
our work on VTrack [32], which studies the relation- 
ship between map-matching accuracy and the accuracy 
of two end-to-end applications: traffic delay monitor- 
ing and traffic hot-spot detection. We found that a me- 
dian precision of 85% was still useful for accurate traf- 
fic delay estimation. Our results for cellular (75%) are 
only somewhat worse, and while not directly compara- 
ble, they suggest a significant portion of delay data from 
CTrack would be useful. 

For applications such as route visualization, or those 
that aggregate statistics over paths (e.g., to compute his- 
tograms over which of n possible routes is taken), or 


USENIX Association 


USENIX Association 


With Sequencing 
Without Sequencing 


Probability 





0 0.2 0.4 0.6 0.8 1 
Precision (Matched Length/Output Length) 


Figure 11: Precision with and without grid sequencing. 


those that simply show a user’s location on a map, getting 
most segments right with a low overall error is likely suf- 
ficient. Our median geographic error is quite low—yjust 
45 meters—suggesting CTrack would have sufficient ac- 
curacy for such applications. In contrast, the median ge- 
ographic error of the Placelab+VTrack approach is 156 
meters, over 3.5 worse than CTrack. 


Filtering using a confidence predictor. We investigated 
whether a confidence metric could be used to filter out 
drives on which CTrack does poorly, thereby trading- 
off some recall for substantially better precision, which 
would be useful for accuracy-sensitive applications. We 
found two predictors, both weakly correlated with map- 
matching accuracy: (a) the 90th percentile distance of 
smoothed grids from the segments they are matched to, 
and (b) the mean difference (over all points P) in emis- 
sion score between the segment that P is matched to 
in the output, and the segment closest to P. The intu- 
ition is that a point far away from the road segment it is 
matched to, or closer to a different road segment, implies 
lower confidence in the match. When applying these con- 
fidence filters to our output drives, we currently improve 
the median precision from 75% to 86%, but lose sub- 
stantially in terms of recall, whose median reduces from 
80% to 35%). In future work, we plan to explore whether 
boosting [12] can combine these weak confidence pre- 
dictors into a stronger one. 


6.4 Benefit of Sequencing 


We elaborate on one of our key technical contributions: 
the idea that the first pass of grid sequencing before con- 
verting fingerprints to geographic locations is crucial to 
achieving good matching accuracy. We provide experi- 
mental evidence supporting this idea. We also show that 
windowing and smoothing help improve matching accu- 
racy, though to a lower degree. 


Impact of Sequencing. Figure 11 is a CDF that com- 
pares the precision of CTrack with and without the first 
pass of grid sequencing. This figure shows that sequenc- 
ing is critical to achieving reasonable accuracy: with- 
out sequencing, the median precision drops from 75% 
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Figure 12: Geographic spread of exact matches. The 
dashed line shows the 80th percentile. 


to 50%. The reason is that running CTrack without se- 
quencing amounts to reducing each fingerprint to its best 
match in the training database, ignoring the sequence of 
points. 

As mentioned earlier, reducing a fingerprint to a single 
geographic location loses information because a given 
cellular fingerprint is seen from multiple locations quite 
far apart. Figure 12 illustrates the CDF of this geographic 
spread. We selected 1000 fingerprints at random from our 
training data. For each fingerprint F’, we found all the 
exact matches for F, i.e. fingerprints F’ with the exact 
same set of towers in the training data as F’. We ordered 
the matches by similarity in signal strength, most similar 
first, and computed the geographic diameter of the top k 
matches for each fingerprint (using k = 4). 

The figure shows that over 20% of matching sets have 
a diameter exceeding 150 meters, and at least 10% have 
a diameter exceeding 400 meters. Recall that the meth- 
ods in Placelab (and RADAR, if applied to cellular data) 
would simply compute the centroid of the top k matches. 
This approach does not work well for sets with a large 
geographic spread, and motivates the need for the funda- 
mentally different approach used in CTrack in which we 
keep track of all possible likely locations, and then use a 
continuity constraint to sequence these locations in two 
Steps. 


Windowing and Smoothing. Table | shows the preci- 
sion and recall of CTrack with and without windowing 
and smoothing, two other heuristics used in CTrack. We 
see that each of these features improves the precision by 
approximately 10%, which is a noticeable quantity. The 
recall does not improve because the algorithm without 
windowing/smoothing is good enough to identify most 
of the segments driven: the heuristics mainly help elimi- 
nate loops in the output. 


6.5 Do Sensor Hints Help? 

Figure 13 illustrates by example how turn hints extracted 
from the phone compass help in trajectory matching. 
Without using turn hints (Figure 13(a)), our algorithm 
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With Without 
Prec. Recall | Prec. Recall 
Windowing 75.4% 80.3% | 65.6% 82.3% 
Smoothing 75.4% 80.3% | 66.5% 82.5% 


Table 1: Windowing and smoothing improve median tra- 
jectory matching precision. 


finds the overall path quite accurately but includes sev- 
eral spurious turns and kinks, owing to errors in cellular 
localization. After including turn hints in the HMM, the 
false turns and kinks disappear (Figure 13(b)). 








(d) With movement hints 


(c) Without movement hints 


Figure 13: Sensor hints from the compass and accelerom- 
eter aid map-matching. Red points show ground truth and 
the black line is the matched trajectory. 


In Figure 13(c), the driver stopped at a gas station to 
refuel, which can be seen from the cluster of ground-truth 
GPS points. Before using movement hints, errors from 
cellular localization were spread out, causing the map- 
matching to introduce a loop not present in the ground 
truth (Figure 13(c)). After incorporating movement hints, 
the speed constraint in our HMM eliminates this loop be- 
cause it detects that the car would not have had sufficient 
time to complete the loop (Figure 13(d)). We note a lim- 
itation of the movement hint: this kind of stop detection 
works because the phone was placed on the dashboard: 
if it had been in the driver’s pocket during refueling, the 
movement hints would not have helped had the driver 
gotten out of the car and been moving about. 

Figure 14 is a CDF that compares the precision of 
CTrack with and without sensor hints (both movement 
and turn). This figure shows that sensor hints improve 
the median precision of matching by approximately 6%. 
While this may not seem huge, there exist several trajec- 
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Figure 15: Precision/Recall CDF For Hint Extraction. 


tories for which the hints do help significantly, suggest- 
ing that using them is a good idea when available. In our 
experience, the main benefit of the hints is in eliminat- 
ing the several “kinks” and spurious turns in the matched 
trajectory, which our metrics don’t adequately capture. 

We used the ground truth GPS to measure how accu- 
rately our CTrack is able to extract individual movement 
and turn hints. We found that the median precision and 
recall of both motion and turn hint extraction exceeds 
75%. 


6.6 How Much Training? 

To quantify the amount of training data essential 
to achieving good trajectory mapping accuracy with 
CTrack, we picked a pool of test drives at random, 
amounting to 5% of our data set (8 hours of data), and 
designated the remaining 95% as the training pool. We 
picked subsets of the training pool of increasing size, 1.e. 
first using fewer drives for training, then using more. In 
each run, the training subset was used to train CTrack 
and then evaluated on the test pool. Figure 16 shows the 
mean precision and recall of CTrack on the test pool as 
a function of the number of drive hours of training data 
used to train the system. The accuracy is poor for very 
small training pools, as expected, but encouragingly, it 
quickly increases as more training data is available. The 
algorithm performs almost as accurately with 40 hours 
of training data as with 120, suggesting that 40 hours of 
training is sufficient for our data set. 
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Figure 16: Prec./Recall vs Training Data Size. 
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Figure 17: CDF of Drive Counts, 40 Hrs Training Data. 


The 40-hour number, of course, is specific to the ge- 
ographic area we covered in and around Boston, and to 
the test pool. To gain more general insight, we measure 
the drive count for each road segment in the test pool, de- 
fined as the number of times the segment is traversed by 
any drive in the training pool. Figure 17 shows the dis- 
tribution of test segment drive counts corresponding to 
AQ hours of training data. While the mean drive count is 
approximately 3, this does not mean each road segment 
on the map needs to be driven thrice to achieve good ac- 
curacy. As the graph shows, about 60% of the test seg- 
ments were not traversed even once in the training pool, 
but we can still map-match many of these segments cor- 
rectly. The reason is that they lie in the same grid cell 
as some nearby segment that was driven in the training 
pool. This result promising because it suggests that train- 
ing does not have to cover every road segment on the map 
to achieve acceptable accuracy. 


7 RELATED WORK 


Placelab performed a comprehensive study of GSM lo- 
calization and used a fingerprinting scheme for cellular 
localization [8]. RADAR used a similar fingerprinting 
heuristic for indoor WiFi localizations [3], and our map- 
matching emission score is inspired by these methods. 
However, neither Placelab nor RADAR address the prob- 
lem of trajectory matching, and are concerned with the 
accuracy of individual localization estimates, rather than 


finding the optimal sequencing of estimates. As shown 
by our results, this sequencing step is critical: applying a 
map-matching algorithm directly to Placelab-style loca- 
tion estimates results in significantly worse accuracy (by 
a factor of over 2 x) compared to CTrack. 

Letchner et al. [17] and our previous work on 
VTrack [32] use HMMs for map-matching. However, 
these previous algorithms use and process (lat,lon) co- 
ordinates as input and use a Gaussian noise model for 
emissions, and are hence unsuitable and inaccurate for 
map-matching cellular fingerprints, as shown by our re- 
sults. Nor do they use sensor hints. 

CompAcc [10] proposes to use smartphone compasses 
and accelerometers to find the best match for a walking 
trail by computing directional “path signatures” for these 
trails. They do not use cell towers. However, from our 
understanding, the paper uses absolute values of compass 
readings. This approach did not work in our experiments, 
because the absolute orientation of a phone can be quite 
different depending on whether it is in a driver’s pocket, 
on a flat surface, or held in a person’s hand. For this rea- 
son, we chose to use boolean turn hints instead, which 
are more robust and can be accurately computed regard- 
less of changes in the phone’s initial orientation or posi- 
tion. For extracting motion hints and detecting walking 
and driving using the accelerometer, we use algorithms 
similar to those in [27, 31, 26]. 

Some previous papers [9, 23, 16] have proposed 
energy-efficient localization schemes that reduce re- 
liance on continuously sampling GPS by using a more 
energy-efficient sensor, such as the accelerometer, to 
trigger sampling GPS. RAPS [23] also uses cell towers to 
“blacklist” areas where GPS accuracy is low and hence 
GPS should be switched off, to save energy. However, 
none of these papers address trajectory matching or pro- 
pose a GPS-free, accurate solution for map-matching. 

Skyhook [29] and Navizon [20] are two commercial 
providers for WiFi and Cellular localization, providing 
databases and APIs that allow programmers to submit 
WiFi access point(s) or cell tower(s) and look up the 
nearest location. However, to the best of our knowl- 
edge, they do not use any form of sequencing or map- 
matching, and focus on providing the best static local- 
ization estimate. 


$ CONCLUSION 


We described CTrack, an energy-efficient, GPS-free sys- 
tem for trajectory mapping using cellular tower finger- 
prints. The key lesson we learned was that sequencing 
cellular fingerprints before matching them is critical to 
achieving good accuracy. On smartphones, our CTrack 
implementation uses close to zero extra energy while 
achieving good mapping accuracy, making it a good way 
to distribute collaborative trajectory-based applications 
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like traffic monitoring to a huge number of users without 
any associated energy consumption or battery drain con- 
cerns. A GPS-free approach to trajectory matching also 
opens up the possibility of providing more fine-grained 
location services on the world’s most popular, cheapest 
phones that do not have GPS, but that do have GSM con- 
nectivity. 


ACKNOWLEDGMENTS 


This work was supported by the National Science Foun- 
dation under grant CNS-0931550. 


REFERENCES 


[1] 


[2] 


[3] 


[4] 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 
[16] 


[17] 


[18] 


Analog Devices AD9864 Datasheet: GSM RF Front End and 
Digitizing Subsystem. http://www.analog.com/static/ 
imported-files/data_sheets/AD9864.pdf. 

Analog Devices, Inc. ADXL330: Small, Low Power, 3-Axis +/-3 
g iMEMS Accelerometer (Data Sheet), 2007. http://www.analog. 
com/static/imported-files/data_sheets/ADXL330.pdf. 

P. Bahl and V. Padmanabhan. RADAR: An In-building RF-based 
User Location and Tracking System. In IVFOCOM, 2000. 

N. Balasubramanian, A. Balasubramanian, and 

A. Venkataramani. Energy Consumption in Mobile Phones: A 
Measurement Study and Implications for Network Applications. 
In IMC, 2009. 

F. Ben Abdesslem, A. Phillips, and T. Henderson. Less is more: 
Energy-efficient Mobile Sensing with SenseLess. In MobiHeld, 
2009. 

GPS and Mobile Handsets. http://www.berginsight.com/ 
ReportPDF/ProductSheet/bi- gps4-ps.pdf. 

A. Boustani, L. Girod, D. Offenhuber, R. Britter, M. I. Wolf, 

D. Lee, S. Miles, A. Biderman, and C. Ratti. Investigation of the 
Waste Removal Chain Through Pervasive Computing. JBM 
Journal of Research and Development, 2010. 

M. Y. Chen, T. Sohn, D. Chmelev, D. Haehnel, J. Hightower, 

J. Hughes, A. Lamarca, F. Potter, I. Smith, and A. Varshavsky. 
Practical Metropolitan-scale Positioning for GSM Phones. In 
UbiComp, 2006. 

I. Constandache, S. Gaonkar, M. Sayler, R. Choudhury, and 

L. Cox. EnLoc: Energy-Efficient Localization for Mobile 
Phones. In IVFOCOM, 2009. 

I. Constandache, R. Roy Choudhury, and I. Rhee. CompAcc: 
Using Mobile Phone Compasses and Accelerometers for 
Localization. In INFOCOM, 2010. 

Fedex intros Senseaware Sensor for Tracking Packages. 
http://www.electronista.com/articles/09/1 1/27/senseaware. 
sensor.sends.temps.drops.more. 

Y. Freund and R. E. Schapire. A Decision Theoretic 
Generalization of Online Learning and an Application to 
Boosting. In EuroCOLT, 1995. 

S. Gaonkar, J. Li, R. R. Choudhury, L. Cox, and A. Schmidt. 
Micro-Blog: Sharing and Querying Content through Mobile 
Phones and Social Participation. In MobiSys, 2008. 
Information On Human Exposure To Radiofrequency Fields 
From Cellular and PCS Radio Transmitters. 
http://www.fcc.gov/oet/rfsafety/cellpcs.html. 

iCartel. http://icartel.net/icartel-docs/. 

M. B. Kjergaard, J. Langdal, T. Godsk, and T. Toftkjer. 
EnTracked: Energy-efficient Robust Position Tracking for 
Mobile Devices. In MobiSys, 2009. 

J. Krumm, J. Letchner, and E. Horvitz. Map Matching with 
Travel Time Constraints. In SAE World Congress, 2007. 

K. Lin, A. Kansal, D. Lymberopoulos, and F. Zhao. 
Energy-accuracy Trade-off for Continuous Mobile Device 
Location. In MobiSys, 2010. 


NSDI 711: 8th USENIX Symposium on Networked Systems Design and Implementation 


[19] 


[20] 
[21] 


[22] 
[23] 


[24] 


[25] 


[26] 


[27] 


[28] 
[29] 
[30] 


[31] 


[32] 


[33] 
[34] 
[35] 


[36] 


LoJack Car Security System For Stolen Vehicle Recovery. 
http://www. lojack.com. 

Navizon. http://www.navizon.com. 

Qualcomm Transportation: OmniTRACKS Mobile 
Communications System. 
http://www.qualcomm.com/products_services/ 
mobile_content_services/enterprise/omnitracs.html. 
OpenStreetMap. http://www.openstreetmap.org. 

J. Paek, J. Kim, and R. Govindan. Energy-efficient Rate-adaptive 
GPS-based Positioning for Smartphones. In MobiSys, 2010. 
PNI Corporation. MicroMag3 3-Axis Magnetic Sensor Module. 
http://www.sparkfun.com/datasheets/Sensors/MicroMag3% 
20Data%20Sheet.pdf. 

Qualcomm inGeo Service. 
http://www.qualcomm.com/innovation/stories/ingeo.html. 

L. Ravindranath, C. Newport, H. Balakrishnan, and S. Madden. 
Improving Wireless Network Performance Using Sensor Hints. 
In NSDI, 2011. 

S. Reddy, M. Mun, J. Burke, D. Estrin, M. Hansen, and 

M. Srivastava. Using Mobile Phones to Determine 
Transportation Modes. Transactions on Sensor Networks, 6(2), 
2010. 

RunKeeper. http://runkeeper.com. 

Skyhook. http://www.skyhookwireless.com. 

Telit GE865 Datasheet. 
http://www.telit.com/module/infopool/download.php?id=1666. 
A. Thiagarajan, J. Biagioni, T. Gerlich, and J. Eriksson. 
Cooperative Transit Tracking Using GPS-enabled Smart-phones. 
In SenSys, 2010. 

A. Thiagarajan, L. Sivalingam, K. LaCurts, S. Toledo, 

J. Eriksson, S. Madden, and H. Balakrishnan. VTrack: Accurate, 
Energy-Aware Road Traffic Delay Estimation Using Mobile 
Phones. In SenSys, 2009. 

TomTom. http://www.tomtom.com. 

Trash Track. http://senseable.mit.edu/trashtrack. 

A. J. Viterbi. Error Bounds for Convolutional Codes and an 
Asymptotically Optimum Decoding Algorithm. In JEEE 
Transactions on Information Theory, 1967. 

Y. Wang, J. Lin, M. Annavaram, Q. A. Jacobson, J. Hong, 

B. Krishnamachari, and N. Sadeh. A Framework of Energy 
Efficient Mobile Sensing for Automatic User State Recognition. 
In MobiSys, 2009. 


USENIX Association 


USENIX Association 


Improving Wireless Network Performance Using Sensor Hints 


Lenin Ravindranath, Calvin Newport, Hari Balakrishnan and Samuel Madden 
MIT Computer Science and Artificial Intelligence Laboratory 
{lenin, cnewport, hari, madden} @csail.mit.edu 


Abstract 


With the proliferation of mobile wireless devices such as 
smartphones and tablets that are used in a wide range of 
locations and movement conditions, it has become im- 
portant for wireless protocols to adapt to different set- 
tings over short periods of time. Network protocols that 
perform well in static settings where channel conditions 
are relatively stable tend to perform poorly in mobile 
settings where channel conditions change rapidly, and 
vice versa. To adapt to the conditions under which com- 
munication is occurring, we propose the use of exter- 
nal sensor hints to augment network protocols. Com- 
modity smartphones and tablet devices come equipped 
with a variety of sensors, including GPS, accelerometers, 
magnetic compasses, and gyroscopes, which can provide 
hints about the device’s mobility state and its operating 
environment. We present a wireless protocol architecture 
that integrates sensor hints in adaptation algorithms. We 
validate the idea and architecture by implementing and 
evaluating sensor-augmented wireless protocols for bit 
rate adaptation, access point association, neighbor main- 
tenance in mobile mesh networks, and path selection in 
vehicular networks. 


1 INTRODUCTION 


With over 172 million devices sold in 2009, smartphones 
are a rapidly growing market [27]. Some analysts predict 
that smartphones and pads/tablets will surpass world- 
wide PC sales by the end of 2011 [20]. These devices 
may well become the dominant mode of Internet access 
in the near future [19]. 

With the proliferation of these “truly mobile” devices, 
it is increasingly common for wireless network proto- 
cols to have to deal with both static and mobile us- 
age within a short time period. Consider, for example, 
a smartphone user at the supermarket who alternates 
between standing still in front of product displays and 
moving between aisles, all the while streaming audio 
through the in-store wireless network. Mobility intro- 
duces difficult problems that wireless network protocols 
must overcome to achieve good performance. During 
motion, the vagaries of wireless communication become 
more pronounced: channel quality varies rapidly, losses 


become more bursty, and assessments of channel behav- 
ior are quickly outdated. Because of this, nodes should 
not maintain long histories, as the rapidly changing chan- 
nel conditions and network topology would quickly ren- 
der them invalid. Routing tables may also need to adapt 
quickly to neighbor changes, and the optimal next-hop 
may depend on the direction and speed of movement. 

However, strategies that compensate for these 
mobility-related difficulties are unlikely to be optimal 
in stationary scenarios [4, 25]. When nodes are static, 
they can average estimates of channel quality, observe 
their neighbors, and compute routes over long time 
scales (many seconds), carefully obtaining and updating 
observations from many packets. In so doing, they can 
correctly avoid reacting to the inevitable short-term vari- 
ations that even static wireless networks encounter (e.g., 
due to short-term fading). Previous work has generally 
not distinguished between these modes, attempting 
instead to adapt seamlessly across extremely different 
network conditions. 

The key insight in our work is that nodes can use exter- 
nal (to the network stack) sensor hints to improve the per- 
formance of wireless network protocols. Our approach is 
practical and readily implementable because almost ev- 
ery smartphone and tablet today comes equipped with 
a wide array of sensors like GPS, accelerometers, com- 
passes, and so on. These sensors are used by applications, 
but are largely ignored by the network stack and proto- 
cols. We show how data from these sensors can provide 
hints to protocols about the mobility mode of the device. 
By “mobility mode,” we mean attributes such as whether 
the device has started moving or Is static, its speed of mo- 
tion, its position, and the heading (direction) of motion— 
all factors that affect wireless network protocol perfor- 
mance. Protocols can explicitly adapt their behavior and 
parameters to the current mobility mode. 

Sensor hints may be used in different ways in dif- 
ferent protocols. When a node generates a hint locally 
or receives a hint from a neighbor, it may adapt in re- 
sponse to it. The adaptation might be continuous in 
nature (e.g., updating protocol parameters) or discrete 
(e.g., switching from a static-optimized to a mobility- 
optimized protocol). In Section 2, we introduce a novel 
sensor-augmented wireless architecture that allows de- 
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vices to extract hints and provide them to protocols. To 
the best of our knowledge, ours is the first general ap- 
proach to using sensor hints to augment a variety of net- 
work protocols. 


In addition to the sensor-augmented network architec- 
ture, we make four contributions: 

1. Hint-aware bit rate adaptation: In Section 3, we 
describe and evaluate our implementation of a novel 
frame-based bit rate adaptation protocol, RapidSample, 
and show through trace-based simulation and testbed 
experiments that it obtains up to 70% better through- 
put than existing frame-based and SNR-based rate adap- 
tation protocols, and comparable throughput to Soft- 
Rate [25], when a node is in motion. We use Rapid- 
Sample to develop a hint-aware bit rate adaptation pro- 
tocol that switches strategies based on mobility hints 
and show through exhaustive trace-based evaluation and 
testbed experiments that it obtains between 17% and 
52% better throughput than SampleRate, 17% and 39% 
better throughput than RRAA, and 11% and 47% better 
throughput than SNR-based schemes, in mixed mobility 
scenarios in various environments. 

2. WiFi access point (AP) association: In Section 4, 
we describe a hint-aware AP association protocol with 
two modes: maximizing bulk transfer throughput and 
minimizing handoffs. We show through trace-based eval- 
uation that the hint-aware protocol improves throughput 
by 30% and reduces the number of handoffs by 40% 
compared to today’s standard scheme. 

3. Mobile topology maintenance: In Section 5, we 
show experimentally that maintaining acceptable error 
rates for topology maintenance while mobile requires 
over 20 times more traffic than in the stationary case. We 
implement a hint-aware protocol that switches to this ex- 
pensive probing only when in motion. 

4. Path selection in vehicular mesh networks: In 
Section 6, we present a collection of hint-aware path se- 
lection metrics for vehicular networks and show, using 
trace-based simulation, that they increase the stability of 
short routes by nearly a factor of 5 compared to the hint- 
free approach. 


2 DESIGN 


Current wireless protocols adapt their behavior based on 
in-network information such as loss rate, bit errors, or 
SNR. In contrast, we present a hint-aware protocol archi- 
tecture that augments this in-network information with 
hints from external sensors, which can be used at all lay- 
ers of the network stack to improve performance. In addi- 
tion to using local sensor hints, a protocol can also adapt 
based on sensor hints communicated from other nodes. 
In this section, we first present a general-purpose hint- 
aware protocol architecture. We then describe simple and 
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Figure 1: Hint-aware protocol architecture. 
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Figure 2: Hint types exposed by the Sensor Library. 


accurate techniques for extracting mobility hints from 
sensors such as GPS, accelerometers and compasses. 


2.1 Hint-Aware Protocol Architecture 


Figure 1 depicts the architecture; the goal is to make it 
easy to augment wireless network protocols with sensor 
hints. The architecture provides a Sensor Hint Service 
that abstracts and hides the details of (1) querying var- 
ious sensors, (2) extracting hints from raw sensor data, 
and (3) communicating relevant hints over the network. 
The service exposes well-defined interfaces to achieve 
these goals. Our current implementation of the Sensor 
Hint Service runs as a background service on the An- 
droid platform and as a Click module for Linux mobile 
devices. It should be straightforward to incorporate this 
service into other mobile platforms. 

The Sensor Hint Service has three components: 

1. Sensor Library. The Sensor Library processes raw 
sensor data to extract useful hints. We focus on mobility 
hints and our implementation currently supports the hint 
types shown in Figure 2. Section 2.2 discusses how these 
hints are extracted. 

2. Hint Transport Layer. Some protocols can bene- 
fit from hints from other nodes. For instance, a bit rate 
adaptation protocol can adapt its bit rate using not only 
its own movement hints, but also movement hints from 
nodes the protocol is communicating with. The Hint 
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Transport Layer provides a protocol-independent way to 
communicate hints. 

When sending a hint to another node, the Sensor Hint 
Manager (described below) constructs a hint message 
(shown in Figure 3) and delivers it to the Hint Transport 
Layer, which in turn sends the hint. The hint message 
consists of the source MAC address and (hint type, hint 
value) pairs. When receiving a hint from another node, 
the Hint Transport Layer delivers the received hint mes- 
sage to the Sensor Hint Manager, which in turn delivers 
it to the appropriate protocol. 

The Hint Transport Layer provides two communica- 
tion mechanisms to send and receive hints. The first 
uses UDP. Each node opens a pre-defined UDP port, the 
HINTS port, to receive hint messages. Hint messages 
may either be unicast or broadcast to this UDP port. 

The UDP scheme works only as long as the nodes 
are connected through IP. In certain hint-aware wireless 
protocols (Section 5 and Section 6) nodes do not have 
IP connectivity, instead communicating via a link-layer 
protocol such as 802.11’s link layer. Thus, for our sec- 
ond scheme, we use a reserved protocol type in the link- 
layer MAC header to denote a hint message frame (Fig- 
ure 3). The Hint Transport Layer then listens for unicast 
or broadcast hints sent in link-layer frames. An alterna- 
tive scheme might be to overload or piggy-back hints on 
existing 802.11 frames; we leave the exploration of this 
possibility to future work. 

Because Android phones do not (yet) support sending 
raw 802.11 frames from user-level, we implemented only 
the UDP mechanism for phones. For Linux devices, we 
implemented both schemes. Legacy nodes not running 
the Sensor Hint Service will simply ignore the hint mes- 
sages, as long as the HINTS port is not in use by some 
other application. 

3. Sensor Hint Manager. The Sensor Hint Manager 
arbitrates communication between the protocol, the Sen- 
sor Library and the Hint Transport Layer. It exposes a lo- 
cal socket interface (different from the HINTS port) for 
protocols to interact with the Sensor Hint Service. Pro- 
tocols register for one or more hints using REGISTER 
(HintTypes[], ReportRate, CallbackPort, Source). 
Once registered, the Sensor Hint Manager uses the Call- 
backPort to stream hints to the protocol. The Source 
field can be LOCAL, REMOTE, or ALL, corresponding 
to local hints, remote hints, or both. The protocol can 
specify a ReportRate, in milliseconds, which indicates 
how often to report the hint. ReportRate also takes two 
special values: “QO” means “as fast as possible” and -1 
means “only when there is a change in the hint state’. 

Protocols use SEND(HintTypes[], SendRate, Com- 
Type, Address) to instruct the service to send hints to 
other nodes. SendRate takes values similar to Repor- 
tRate in the REGISTER command, with the same con- 


ventions. ComType specifies the communication types 
(currently either UDP or MAC frames). Hints may be 
unicast to a specific node or broadcast in either Com- 
Type setting. 

REGISTER and SEND both return a unique ID to the 
protocol. The protocol can use the returned ID to stop 
sending hints using the STOP (ID) command. 


2.2 Extracting Hints 


In this section, we describe how to extract the hints 
shown in Figure 2—movement, walking, heading, speed, 
and environment—using standard sensors found on most 
smartphones and tablets. 

Movement hint. Movement is a boolean hint that is 
true if, and only if, a device is moving, 1.e., if either the 
device’s acceleration or its speed is non-zero. We obtain 
this information from the acceleration sensor indoors, 
and from the combination of GPS and the acceleration 
sensors outdoors. Note that it is important to quickly cap- 
ture the situation when a device has started moving after 
being at rest, and vice versa, so measuring the accelera- 
tion is important. 

The accelerometer on most smartphones reports force 
values for its x, y, and z axes, at a certain sample rate 
(usually 20-500 Hz). The values are reported either in 
m/s? or in terms of g (= 9.8 m/s”). Figure 4 plots a raw 
accelerometer trace of a smartphone user who walks in 
the 6—14 second and 22—32 second periods, and is static 
the rest of the time. The accelerometer shows a signifi- 
cantly higher variance while moving than when station- 
ary. We use this variance to extract a movement hint. 

For every new accelerometer sample, we compute the 
standard deviation of the magnitude of the acceleration 
over a sliding window (w) of samples. The window slides 
by one sample for each computation. If the standard de- 
viation in a window exceeds a threshold (a), we detect 
movement. When the standard deviation is within the 
threshold for n successive sliding windows, we report 
that the node is stationary. 

We experimented with many values for w, a, and n and 
determined that w = 5,a = 0.15 m/s”, and n = 10 gave 
us few false hints. Figure 5 illustrates our movement hint 
extraction for the trace in Figure 4. We have implemented 
the above technique on four different platforms (Android 
Nexus one, Android Google G1, iPhone 4 and SparkFun 
accelerometer that connects to a Linux laptop) and found 
that the parameters offer good performance in all cases. 

On the Android platform with a maximum accelerom- 
eter sample rate of 50 Hz, we were able to detect move- 
ment within 100 ms and detect that the node became sta- 
tionary within 200 ms. On the Sparkfun platform, with a 
sample rate of 500 Hz, we were able to detect movement 
within 10 ms and stationarity within 20 ms. 

The movement hint is used by the protocols described 
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Figure 3: Hint message and packet for- 
mats. trace. 
in Section 3 and Section 5 to improve bit rate adaptation 
and topology maintenance, respectively. 

Walking hint. Whereas a simple movement hint is 
useful in some cases, in other situations it is valuable to 
detect whether a user is walking versus other types of 
movement, such as when the user is stationary but mov- 
ing the device. We accomplish this using the walking de- 
tector developed in TransitGenie [22] and apply it to AP 
selection (Section 4). 

Heading hint. Heading can be determined from dig- 
ital compasses (magnetometers) that are available on 
many devices. GPS also allows us to infer a heading 
when a device is moving outdoors. These sensors pro- 
duce a heading in degrees relative to the earth’s magnetic 
north pole. To use a compass to determine the heading of 
the user holding a device, and not the heading of the de- 
vice itself, it is necessary to first determine the device’s 
orientation. The standard technique used by inertial navi- 
gation systems is to use gyroscope sensors in conjunction 
with the accelerometer to infer this orientation [21]. In 
our indoor experiments, we assume we know the orien- 
tation of the device, and use only the compass readings. 
These heading hints are used by the protocols described 
in Section 4 and Section 6 to improve access point selec- 
tion and vehicular path selection, respectively. 

Speed hint. To determine a speed hint outdoors we 
can use the speed values reported by GPS. We use this 
hint in Section 6 for path selection. 

Environment (indoor/outdoor) hint. To determine 
whether a user is indoors or outdoors we use the fact 
that it is typically impossible to get a GPS fix indoors. 
In Section 4 we use this hint to improve AP association. 


3 HINT-AWARE BIT RATE ADAPTATION 


Sensor hints aid in bit rate adaptation because node mo- 
bility affects wireless channel conditions, causing large 
and bursty changes over short intervals of time. When a 
node moves, bit errors and packet losses exhibit a higher 
degree of statistical correlation with past behavior as 
compared to the static case. We demonstrate this effect 
in Figures 6 (left) and 6 (middle). 

Figure 6 (left) plots the conditional probability of los- 
ing packet number i+ k at a given bit rate, given that 
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Figure 4: Raw accelerometer 


Time (s) 


Time (s) 
Figure 5: Movement hint extrac- 
tion from accelerometer data. 


packet number 7 was lost, for different values of k (the 
“lag’’). In this indoor experiment, we sent back-to-back 
1000-byte packets at 54 Mbits/s from a stationary lap- 
top to a stationary smartphone in the static case, and to 
a smartphone carried by a walking user in the mobile 
case. A link-layer ACK received from the smartphone 
indicated a packet success, otherwise the packet was 
considered lost. The graph shows a significantly higher 
loss probability for small values of k in the mobile case, 
demonstrating a larger degree of short-range dependence 
compared to the static case. In this scenario, for the mo- 
bile case, the next packet following a lost packet 1s signif- 
icantly more likely to be lost than in the static case, and 
also compared to larger values of k. For both the static 
case and the mobile case, the unconditional loss proba- 
bility was around 23%. 


For the same traces, Figure 6 (middle) shows the mu- 
tual information between packet success/failure events 
separated by x ms. Specifically, we compute the mutual 
information between every pair of two success/failure 
events separated by a time interval of x ms for a range 
of different values of x. This measure shows the extent 
to which the fate of a later packet depends on the earlier 
one. In the static case, there is no mutual information be- 
tween packets. But when a node moves, packets exhibit 
a higher degree of dependence with the past few pack- 
ets. This dependence drops off at around 10 ms in these 
experiments. In Figure 6 (right), we plot the mutual in- 
formation curve for different walking speeds and found 
the dependence to drop off at around 10—20ms. 


These results show that the best strategy for bit rate 
adaptation is likely to be different when nodes move than 
when they are static. In more detail, in the static case, 
where the channel remains relatively stable, it makes 
sense to maintain a longer history of performance at dif- 
ferent bit rates to smooth over periods of short-term fad- 
ing or contention. Such a long-history approach falters 
when the device is mobile, because in the mobile case 
it makes more sense to keep only a short history, re- 
act quickly to errors, and perhaps sample other rates ag- 
gressively to track the faster changes typical of a mobile 
channel. 


This observation motivates a hint-aware bit rate adap- 
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Figure 6: Left: Given a packet loss, the conditional probability of losing the k“” packet following the loss, as a function 
of the lag, k. The unconditional loss probability for both the static and mobile cases was around 23%. Middle: Mutual 
information between packets separated by x ms specified by the x-axis value. In the static case, there is essentially 
no mutual information between packets, while in the mobile case, packets separated by less than 10 ms show a high 
degree of dependence. Right: Mutual information between packets separated by x ms for various walking speeds. 





RapidSample(lastbr, gotack) : 
if (!gotack) then 
failedTime|lastbr| — CurrTime() 
if (sample) then 
br — oldbr 
else 
br — max{0,lastbr — 1} 
sample — 0 
else 
sample — 0 


if (CurrTime() - pickedTime|lastbr] > Ssuccess) then 


br — max{i| Vj <i: 
CurrTime() — failedTime|j] > 5 fait} 
sample — | 
oldbr — br 
else br — lastbr 
if br ~ lastbr 
pickedTime|br| — CurrTime() 
return br 


Figure 7: The RapidSample bit rate adaptation algorithm. 
It is called for each packet with /astbr describing the bit 
rate index and gotack describing whether an ack was 
received for the previous packet. Time is reported in 
elapsed milliseconds. 


tation scheme, which adapts differently depending on 
whether or not the nodes are moving. By using external 
sensor hints rather than making decisions based solely 
on network information, our goal is to combine schemes 
tuned separately for the static and mobile cases. The ap- 
proach requires no training to achieve good performance. 


With these remarks in mind, we introduce RapidSam- 
ple, a frame-based rate adaptation protocol designed for 
a channel undergoing rapid changes due to movement. 


3.1 The RapidSample Protocol 


The RapidSample protocol is shown in Figure 7. It starts 
with the fastest bit rate. If a packet fails to get a link layer 
ACK, the protocol switches to the next lowest rate and 
records the time of the failure. After success at a partic- 
ular bit rate for more than Ogyccesy milliseconds (5 in our 
implementation), the sender attempts to sample a higher 
bit rate. It chooses the fastest bit rate: (a) that has not 
failed in the last 6a; milliseconds (10 in our implemen- 
tation), and (b) for which there is no slower bit rate that 
has failed within this interval. If the faster rate fails, it re- 
verts to the original rate; if it succeeds, it adopts this new 
faster rate. 


There are four ideas motivating RapidSample. First, 
we observed that when a packet fails while a node is 
moving, the probability of the next few packets failing at 
this bit rate is high (Figure 6, left). Therefore, the pro- 
tocol immediately reduces the bit rate. Second, as we 
showed in our discussion of Figure 6 (middle), the mu- 
tual information between the the fate of packets x mil- 
liseconds apart becomes small when x is around 10-15 
ms for all the indoor movement speeds we tested. We 
use a value of 10 ms for Ofgi; as the minimum time to 
wait before sampling a previously failed rate, and before 
sampling any rate higher than the failed rate. 


Third, RapidSample attempts higher rates after only 
a small number of successes at the current rate. We set 
Osuccess to be less than Sfgi. In general, it is difficult to 
tell if the channel conditions are improving or degrad- 
ing, but under movement, we posit that if conditions are 
not degrading, they are probably improving because it is 
unlikely that they are invariant. Thus, even a few suc- 
cesses at one rate provide enough confidence to sample 
higher rates that have not recently failed. Fourth, if we 
are wrong about the channel improving, and a higher rate 
fails, we immediately revert to the original rate. 
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3.2. Hint-Aware Bit Rate Adaptation Protocol 


The Hint-Aware Rate Adaptation Protocol implemented 
at the sender uses RapidSample when a node is movy- 
ing and uses SampleRate [3] when a node is static. It 
relies on movement hints from the receiver to switch be- 
tween the two. We use SampleRate for the static case as it 
performed better than other frame-based and SNR-based 
protocols in various environments (see Section 3.3). 


3.3. Evaluation 


We use both trace-driven simulation and testbed experi- 
ments to evaluate our hint-aware rate adaptation scheme. 


3.3.1 Trace-driven Simulation 

To replicate the same mobility pattern between different 
experiments, we used trace-driven simulation—feeding 
real-world experimental data to a wireless simulator, al- 
lowing for both reproducibility and realism. We used the 
same experimental architecture as [25], which modified 
the ns-3 network simulator (v3.2) to read in experimental 
traces describing, for each 5 ms time slot, the fate of each 
packet sent at each bit rate during that time slot. This 
setup bypasses the physical layer’s propagation model, 
instead referencing the trace file to determine if a packet 
should be received successfully. 

To collect the traces, we configured a Linux laptop as a 
sender. It ran the Click router using the Mad WiFi 802.11 
driver, which in turn used an Atheros 802.11 chipset. The 
laptop sent a constant stream of 1000 byte packets, cy- 
cling through the 802.1la OFDM bit rates of 6, 9, 12, 
18, 24, 36, 48, and 54, in round-robin order. Each cy- 
cle through all 8 bit rates took approximately 5 ms. In- 
doors, we used 802.1la to minimize interference with 
local infrastructure networks. We configured a second 
laptop with the same hardware to act as a receiver, log- 
ging every received packet. This laptop was additionally 
equipped with a SparkFun serial accelerometer for move- 
ment hints. 

We collected several traces from four different envi- 
ronments for static and mobile scenarios: 1) an office 
setting with no line-of-sight between the sender and re- 
ceiver, 2) a long hallway with line-of-sight between the 
nodes, 3) an outdoor setting with a lightly crowded out- 
door pavement area, and 4) a vehicular setting where the 
sender is stationary on the roadside and the receiver is in 
a moving car near MIT (an urban area). 

We evaluated the following frame-based bit rate 
adaptation protocols: RapidSample, SampleRate [3], 
RRAA [26], and our hint-aware method that switches be- 
tween RapidSample and SampleRate, depending on the 
sensor hint. We also evaluated two SNR-based rate adap- 
tation protocols: RBAR [7] and CHARM [8]. For both 
these schemes, we trained the protocol for the operating 
environment. We also assumed that the sender has up- 
to-date knowledge about the receiver SNR. Finally, we 
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compared our protocol to SoftRate [25], a bit rate adap- 
tation scheme that uses SoftPHY hints from a modified 
physical layer and which can adapt the bit rate on a per- 
packet basis without requiring training. For this compar- 
ison we used the traces from [25]. 

Figure 8 shows the performance of the hint-aware pro- 
tocol compared to the other rate adaptation protocols for 
three of the four environments (we discuss the vehicu- 
lar setting later in this section). For each environment, 
we collected 10—20 traces. Each trace is 20 seconds long 
with 50% static and mobile periods. The receiver was 
static for 10 seconds and mobile for 10 seconds in each 
trace. The workload we used was TCP. The graph shows 
the average TCP throughput of all the schemes as a frac- 
tion of the throughput obtained by the hint-aware proto- 
col. The error bars show the 95% confidence interval. In 
every environment, the hint-aware protocol obtained sig- 
nificant performance gains. It improved over SampleRate 
by 23% to 52% on average, over RRAA by 17% to 39%, 
and over RBAR by 11% to 47%. We do not show the 
numbers for CHARM as the performance of RBAR and 
CHARM was similar in all cases with RBAR performing 
slightly better. 

We also evaluated the different protocols separately 
for mobile and static scenarios. For each scenario, we 
collected ten 20-second traces in each of the test envi- 
ronments. Figure 9 shows the average TCP bulk trans- 
fer throughput of all the schemes as a fraction of the 
throughput obtained by RapidSample, in the mobile case. 
RapidSample performed significantly better than other 
schemes in every environment. It obtained up to 75% 
better throughput on average than SampleRate and up to 
25% better than other protocols. It achieved about 28% 
more throughput than SampleRate, 36% more through- 
put than RRAA and nearly 2 more throughput than the 
SNR-based protocols. These performance gains come 
from RapidSample’s ability to cope up with the rapid 
fluctuations in the channel conditions when a node is mo- 
bile. 

On the other hand, RapidSample is the worst- 
performing protocol in the static case, as shown in Fig- 
ure 10. It achieved 12% to 28% lower average throughput 
compared to SampleRate and up to 18% lower through- 
put compared to RRAA. The poor performance is be- 
cause RapidSample aggressively reduces the rate even 
on a single loss and frequently tries to sample higher 
rates even when the channel conditions are not changing. 
Figure 10 also shows that SampleRate usually achieved 
higher throughput than other protocols when the nodes 
are static. Hence, we decided to use SampleRate for the 
static case in our hint-aware rate adaptation protocol. 

We also measured the performance of RapidSample in 
a vehicular setting, where the sender was stationary on 
the roadside and the receiver was placed in a moving car. 
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forms better in mixed-mobility set- 
ting. 


We collected 10 traces, each 10 seconds long. Figure 11 
shows the results, where the traffic workload is UDP (at 
a rate of 36 Mbps), as TCP repeatedly times out when 
faced with high packet loss rate [6]. Similar to other mo- 
bile environments, RapidSample outperformed the other 
schemes. 
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Figure 11: In vehicular environments, RapidSample 
achieves significantly higher throughput compared to 
other schemes. 


Finally, in Figure 12, we compare RapidSample to 
SoftRate, SampleRate, and RRAA, using the walking 
traces and ns-3 protocol implementations from [25]. 
RapidSample performs nearly as well as SoftRate on 
these traces, without the aid of SoftPHY hints, further 
confirming the effectiveness of RapidSample in mobile 
settings. As a result, our hint-aware protocol performs 
nearly as well as SoftRate, but is readily deployable on 
many of today’s commodity devices. 
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Figure 12: RapidSample performs almost as well as Sof- 
tRate on traces collected while walking. 
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Figure 10: In the static case, Rapid- 


to the other schemes. 


3.3.2 Testbed Experiments 


Trace-driven evaluation allows us to compare the perfor- 
mance of various protocols, but there might be a con- 
cern that the method used does not correctly account for 
the channel variations observed in practice. To show that 
the scheme can work in real-time, we deployed a testbed 
of Android Nexus One smartphone receivers commu- 
nicating with a MadWiFi-based Linux laptop sender. 
We implemented the frame-based rate adaptation proto- 
cols (SampleRate, RRAA, RapidSample, and our hint- 
aware protocol) on the laptop as user-level Click mod- 
ules; we did not implement the SNR-based protocols 
as they required SNR feedback from the receiver. The 
hint-aware protocol used the Sensor Hint Service on the 
laptop to monitor for movement hints from the smart- 
phone. It switched between SampleRate and RapidSam- 
ple schemes based on movement hints. The implemen- 
tation on the smartphone instructed the Sensor Hint Ser- 
vice to send movement hints to the laptop using UDP. 
The movement hints were sent every second and on hint 
changes (“static to moving” or “moving to static’). 

We configured the laptop to send 802.11 data pack- 
ets to a smartphone’s wireless MAC address. Upon re- 
ceiving the packet, the phone responds with a link-layer 
ACK. We put the phone in tethering mode, to disable the 
802.11 power-saving mode that was on by default. We 
measure the performance of bit rate adaptation based on 
the received ACKs. 

We evaluated the protocols in two environments: an 
office setting and a long hallway setting, the same as in 
the trace-based evaluation. In each environment, we used 
10 distinct mixed-mobility patterns and measured the 
throughput of each scheme. In each mobility pattern, a 
user walked in a predefined trajectory at a constant speed 
and stayed static at predefined locations for predefined 
amounts of time. Each such trial took 45-90 seconds to 
complete and had an equal amount of static and walking 
periods. The phone was held by the user in the same way 
across experiments. Since it was hard to exactly replicate 
the same mobility pattern, we repeated each trial 3 times 
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and report the average and the standard deviation. A trial 
consists of running SampleRate, RRAA, RapidSample 
and the hint-aware protocol back-to-back for the same 
mobility pattern. Three back-to-back trials correspond to 
one experimental run. 

Because the smartphone only had a 802.11b/g card, we 
did all these experiments in the relatively busy 802.11b/g 
channels. To minimize interference from the existing ac- 
cess points, we ran the experiments late at night. In every 
experiment, we sent a stream of 1000-byte packets back- 
to-back. The bit rate of each outgoing packet was con- 
trolled by the rate adaptation scheme. We measured the 
throughput based on the received link-layer ACKs. The 
maximum throughput we were able to obtain from the 
user-level Click implementation was around 27 Mbps. 

Figure 13 (left) shows the measured throughput of dif- 
ferent protocols in the two environments. For each en- 
vironment, we plot the average throughput and standard 
deviation (as error-bars) for 10 different runs. The charts 
show the results sorted by the throughput of the hint- 
aware scheme. 

In both environments, the hint-aware protocol consis- 
tently outperforms the other schemes. In the office set- 
ting, it improved over SampleRate by between 10% and 
76%, over RRAA by between 8% and 100%, and over 
RapidSample by between 11% to 41%. On average, it 
obtained 20% more throughput than SampleRate, 22% 
more throughput than RRAA, and 19% more through- 
put than RapidSample. In the hallway setting, the hint- 
aware protocol obtained 9%—49% more throughput than 
SampleRate, 8%—80% more throughput than RRAA, and 
5%-85% more throughput than RapidSample. On aver- 
age, it improved over SampleRate, RRAA, and Rapid- 
Sample by 17%, 37%, and 22% respectively. 

Compared to trace-driven results, SampleRate per- 
formed better than RRAA in these testbed experiments, 
especially in situations where the throughput of all the 
schemes was low. RRAA performed better when the 
throughput was higher. Otherwise, the testbed results 
were fairly consistent with the trace-driven simulations. 

During each trial, for every packet sent, in addition to 
logging if an ACK was successfully received, we logged 
the movement hint as well. We processed these traces 
and used the movement hint to split them into static and 
mobile phases and measured the throughput separately 
for each scenario. Figure 13 (middle) shows the average 
throughput obtained during the mobile phases of the cor- 
responding experimental runs shown in Figure 13 (left). 
In mobile scenarios, RapidSample performs significantly 
better than SampleRate and RRAA in both the environ- 
ments. On average, it improved over SampleRate by 61% 
and 40% in the two environments and over RRAA by 
16% and 39%. The relative performance of SampleRate 
was worse in the office setting compared to the hallway 
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setting. This result is consistent with what we found in 
the trace-driven evaluation. Similarly, Figure 13 (right) 
plots the mean throughput for the static phases. As found 
in the trace-based simulation, SampleRate 1s the best pro- 
tocol in the static case and RapidSample performed much 
worse than SampleRate and RRAA. 


3.4 Discussion 


In our scheme we use only a binary movement hint that 
indicates whether the device is stationary or mobile. An 
important conclusion from our results is that even such 
a simple hint can produce significant performance im- 
provements. An obvious future direction is to general- 
ize our scheme to use additional hints such as speed and 
heading. Using Figure 6 (right), it is possible to fine-tune 
parameters in RapidSample for different speeds. While 
it is easy to get a movement hint, measuring speed ac- 
curately indoors using the sensors available on a smart- 
phone is a challenging unresolved problem. 

The use of hints for bit rate adaptation may improve 
PHY-assisted techniques such as SoftRate [25], which 
perform significantly better than existing protocols in the 
mobile case using an instantaneous estimate of the bit 
error rate. Augmented with a movement hint, however, 
they may be able to adapt their behavior in the static case 
to average these estimates and avoid reacting to short- 
term fading. 

One benefit of using the accelerometer for a move- 
ment hint is that it detects even small movements of the 
device—e.g., a user moving a smartphone from his head 
to pocket—which can change the channel conditions. Of 
course, it is also possible that the channel conditions can 
change due to changes in the surrounding environment, 
even if the device is stationary. If such changes are short- 
lived, then SampleRate, the protocol we use during sta- 
tionary periods, performs well. 

Our trace-driven evaluation shows that the hint-aware 
protocol performs better than trained SNR-based adap- 
tation in all the tested environments. One question that 
might arise is whether a protocol could simply use in- 
formation about changes in the observed RSSI values to 
infer movement and use a protocol like RapidSample in 
that case, without relying on external sensor hints. We 
explored this approach and found several problems with 
it. First, RSSI values showed large variations even when 
a node was static. Second, the magnitude of these varia- 
tions depended strongly on the environment and the de- 
vice. It also varied significantly across time and across 
different RSSI ranges (low RSSI ranges showed more 
fluctuations than high RSSI ranges). Third, the reported 
RSSI was extremely sensitive to movement in the en- 
vironment and triggered many false hints. Hence, us- 
ing RSSI was more error-prone than using explicit hints. 
Furthermore explicit hints avoid the need for training. It 
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Figure 13: Throughput of the different bit rate adaptation protocols. The left-most chart shows all the protocols with 
one data point per run; the error-bars are the standard deviations. There are ten runs inside offices and ten in the 
hallways, with each run lasting 45—90 seconds. The middle chart shows the results considering only times when the 
device was moving, while the right-most chart shows the results from the same runs considering only times when the 
device was static. In these experiments, the hint-aware protocol was better than the next-best protocol SampleRate 
by between 20% (office) and 17% (hallway), with a mean overall improvement of 19%. When mobile, RapidSample 
outperformed SampleRate by 61% (office) and 40% (hallway), with a mean overall improvement of 50%. 


is, of course, conceivable that one could combine RSSI 
and sensor hints to further improve bit rate adaptation; 
achieving this goal without environment-specific train- 
ing remains an open question. 


4 HINT AWARE AP ASSOCIATION 


Most wireless clients associate with the AP with the 
strongest RSSI (SNR) value. When the RSSI falls below 
some fixed threshold, the client triggers a handoff, where 
it scans all the channels and associates with the AP with 
the strongest RSSI. We refer to this approach as the stan- 
dard scheme.' As others have observed [14, 18], the stan- 
dard scheme is sub-optimal in many settings, particularly 
when the client is mobile. In this section, we develop a 
hint-aware association protocol that performs better than 
the standard scheme. 

In our scheme, a node detects whether it is indoors or 
outdoors using a GPS lock hint. If it is indoors, its associ- 
ation strategy uses the “walking” hint (Section 2.2) to de- 
tect motion. The protocol may be configured at run-time 
to either maximize throughput (Section 4.1), or minimize 
the number of handoffs (Section 4.2); the former is use- 
ful for bulk transfers, while the latter is useful for inter- 
active real-time applications such as telephony for which 
the hundreds of milliseconds spent during a handoff will 
disrupt communication, increasing both jitter and packet 
loss [9] (handoffs took approximately 600 ms on the 
Android smartphones used in our experiments). When a 
node is outdoors, it implements a similar strategy, using 
the position and speed as hints. We do not evaluate the 
outdoor case in this paper. 

We implemented our association protocol as an eas- 
ily deployable background Android application. Below, 
we describe the two modes of the protocol and evaluate 
their performance. Our experimental results with indoor 
mobility show a median throughput increase of 30% and 


'Some association schemes do include periodic scans, but they are 
done only every few minutes, and never while transferring data. 


a median reduction of 40% in the number of handoffs 
compared to the standard scheme. 


4.1 Using Hints to Maximize Throughput 


We present a hint-aware AP association strategy for max- 
imizing throughput. The strategy builds on three obser- 
vations. First, when a client is moving, the probability 
that a new AP with a stronger signal enters its range is 
higher than when the client is static. Hence, when mo- 
bile, a client should scan periodically to discover better 
APs: the throughput gain of associating with a better AP 
is likely to be higher than the throughput lost to the scan. 
The periodicity depends on the speed of the client and 
the expected range of the typical AP in the deployment. 

Second, when a client is stationary, it is less likely 
that new, better, APs will be discovered. In this case, the 
penalty of a scan is not worth incurring. 

Third, when a client stops moving, it may remain 
static for some period of time. If so, it is worth perform- 
ing a scan on this transition because the AP with the 
strongest RSSI is likely (though not guaranteed) to re- 
main optimal until the client moves again. When static, a 
client should re-scan and re-associate only when it starts 
moving again, or when the current AP’s RSSI becomes 
weaker than some threshold. In our experiments with the 
standard scheme, when a client moves from one location 
to another nearby location, in many cases it remains as- 
sociated with the original AP (because the signal strength 
remains above the handoff threshold), reducing through- 
put. By rescanning immediately following a transition 
from mobile to static, we avoid this problem. 

Our protocol is simple. When the association daemon 
running on the client detects that the client is walking, it 
scans periodically, every 7;. seconds, for the AP with the 
highest RSSI. If the client goes from the moving to static 
state, it performs a scan immediately and associates with 
the strongest AP. When it is static, it performs no scans, 
unless the RSSI drops below a threshold or if the client 
starts moving. 
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Figure 14: CDF of the ratio of throughput obtained by 
the hint-aware scheme to the throughput obtained by the 
standard scheme, calculated for 30 traces. 


The ideal value of 7;- is the time it is likely to take 
for the current AP to no longer be the best one while 
the user is moving—a factor which depends on how APs 
are deployed and how fast the user is moving. To get a 
sense for what it might be in practice, we wrote a data 
collection application on the Android platform that scans 
every second, recording the signal strength of every AP it 
hears. It also records the walking and heading hints with 
each scan. We convert each RSSI value to a throughput 
value using a rate map as in [11]. 

We collected several such traces with different move- 
ment patterns and pedestrian speeds indoor in two dif- 
ferent buildings on the MIT campus. We found that at 
pedestrian speeds, a value of T;. = 8 seconds maximized 
throughput. In other words, 8 seconds is approximately 
the time required to walk between two adjacent APs. 

Performance evaluation. To quantify the throughput 
gains of our hint-aware protocol, we collected 30 walk- 
ing traces in MIT hallways. These traces are different 
from the ones we analyzed to determine the value of 7;,, 
but the setting was the same. We had the client transition 
from moving to stationary states randomly, with roughly 
equal time spent in each state. 

We performed a trace-based evaluation of our hint- 
aware association protocol compared to the standard 
scheme, on these traces. Figure 14 shows the CDF of 
the ratio of throughput obtained by our scheme to the 
throughput obtained by the standard scheme. The median 
throughput improvement is about 30%. 


4.2 Using Hints to Minimize Handoffs 


We now present a hint-aware AP association strategy 
for minimizing the number of handoffs, which is use- 
ful for applications such as VoIP. Our protocol requires 
lightweight training that can be deployed as a back- 
ground application on standard phones. The protocol 
uses the observation that to minimize handoffs, the AP 
with the strongest RSSI is not necessarily the AP that will 
yield the longest-lasting connection. If a client is moving 
towards an AP, for example, it is likely to stay connected 
longer than if it is moving away, even if the RSSI at the 
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Figure 15: CDF of the ratio of number of handoffs us- 
ing the heading-aware handoff scheme to the standard 
scheme calculated over the 30 testing tracks. 


time of the scan is not the highest among the set of ob- 
served APs. Our protocol uses heading hints to aid such 
decisions. 

To train our protocol for a specific environment, we 
use the Android data collection application described 
earlier. Every second, this application logs the device’s 
heading along with a list of APs and their signal 
strengths. We use this data to compute a model that maps 
from a <heading, current AP> pair to a preferred AP, 
where the preferred AP is the AP to associate with when 
handing off from the current AP at the given heading to 
maximize connection time. 

Once trained, the protocol works as follows. If it de- 
tects the client is stationary it uses the standard scheme. 
If the protocol detects the client is walking, it extracts a 
heading hint. It then queries the model using this heading 
hint and its currently associated AP. The model looks up 
similar <heading, AP> pairs, and returns the AP that the 
client should associate with once the current AP’s signal 
strength drops below the handoff threshold. The model 
attempts to select the AP that will maximize connection 
time. If the AP returned by the model is not seen during 
the scan for handoff, the protocol defaults to the standard 
method of choosing the AP with the highest RSSI. 

To evaluate our protocol, we collected 60 tracks using 
several Android phones, walking through various MIT 
hallways. For each track, the user chose an arbitrary path 
in the building complex, and walked between 3-5 min- 
utes. We split the data into training and testing sets— 
training using the former and testing using the latter. Fig- 
ure 15 shows that the number of handoffs in our scheme 
is 40% lower than in the standard scheme. It also shows 
that for over 90% of the traces, our protocol yielded an 
improvement of at least 10%. 


5 TOPOLOGY MAINTENANCE 


In this section, we study the use of hints to improve the 
accuracy and efficiency of topology maintenance in wire- 
less mesh (and sensor) networks. Here, each node often 
maintains a list of neighbor nodes along with the quality 
of connectivity to each neighbor. The standard method 
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for maintaining neighbors and link quality information in 
this setting is for a node to send periodic probe packets. 
Each recipient maintains the packet loss rate of packets 
from its neighbor, and may exchange this information in 
the routing protocol’s messages. A key parameter is the 
probing rate: how often should these periodic messages 
be sent? In practice, a node may send these messages at 
more than one bit rate to produce link quality information 
at different bit rates. 

In determining the frequency of these probes, two op- 
posite considerations must be reconciled. On the one 
hand, sending frequent probes allows the nodes to main- 
tain an accurate estimate of link qualities and identify 
changing topologies. Maintaining accurate values for this 
metric is important to avoid packet losses, which can in- 
crease the number of retries and also incorrectly slow 
down the bit rate. On the other hand, frequent probe 
packets themselves use large chunks of the bandwidth 
and increase network contention. This tradeoff becomes 
even more acute in mobile settings, where link quality 
changes rapidly. For instance, Figure 16 (left) captures 
the channel behavior that we observed in a mixed station- 
ary/mobile setting. This plot shows the packet delivery 
ratio when the user is moving (derived from our move- 
ment hint) over time for one specific trace. To calculate 
the delivery ratio, we bucketed time into intervals of 1 
second, during which time the sender transmits approxi- 
mately 200 packets at each bit rate. The key observation 
is that motion causes the packet delivery ratio to fluctu- 
ate, with many of the jumps in the delivery ratio exceed- 
ing 20%. 

Our idea is simple: because channel conditions vary 
much more in the presence of movement, probe fre- 
quently when a node receives movement hints from its 
neighbor or itself moves, and probe less often when the 
nodes are static. 


5.1 Measurement 


To evaluate the potential gains, we gathered data in an 
indoor environment where the sender was static and the 
receiver was either at a fixed location (stationary) or 
was moved at walking speeds (mobile). The sender sends 
probes at a rate of 200 probes per second. We calcu- 
late the actual delivery probability over a sliding win- 
dow of 10 packets from these rapidly sent probes, sub- 
sampling the outcome of these probes to determine the 
delivery probability at various lower probing rates. We 
collected 20 stationary and 20 mobile traces, each 180 
seconds long. We aggregate the results of the static cases 
into one set, and the mobile cases into another set. For 
each set, we calculate the error in the delivery proba- 
bility estimate, which depends on the probing rate, as 
[Observed probability — Actual probability]. 

Figure 16 (middle) shows the average error in deliv- 


ery probability calculated from all the error samples for 
the static case as a function of the probing rate; the error 
bars show the standard deviations. Even a low probing 
rate of 1 packet every 10 seconds has an error of only 
11%, suggesting that the default probing rate of many 
wireless networks of 1 probe/s may be too high. In con- 
trast, Figure 16 (right) shows that the error in delivery 
probability is much higher in the mobile case, exceed- 
ing 35% even at a probing rate of 1 packet every 2 sec- 
onds. To achieve an error of about 10%, the mobile case 
requires a probing rate of 5 probes per second, which 
is more than 25x higher than for the static case at the 
same error rate. For a desired error of 5%, the mobile 
case needs 10 probes/s, while the corresponding rate for 
the static case is 0.5 probes per second, a 20x difference. 


To understand the reason for this difference, consider a 
representative 25-second mobile trace in Figure 17 (left). 
The estimated probability does not track the actual prob- 
ability except at a high probe rate. This differs from what 
is observed in the static case. 


5.2 Hint-Aware Topology Maintenance Protocol 


We implemented a hint-driven topology maintenance 
protocol using rates of 1 and 10 probes per second for 
the stationary and mobile cases, respectively. The proto- 
col continues to send at the fast probe rate for one second 
after the node stops moving in order to estimate the cor- 
rect metric, before slowing the probe rate down. 


Figure 17 (right) compares the performance of our 
protocol to the standard | probe/s protocol. We also 
plot the movement hint, with a raised value indicating 
movement. Notice that our adaptive protocol maintains 
an accurate assessment of the actual delivery probabil- 
ity throughout the experiment, while the non-adaptive 1 
probe/s strategy lags by multiple seconds. Note that in 
some cases, the 1 probe /s approach mis-estimates the de- 
livery probability by more than 30%, whereas, the adap- 
tive estimator is always within 5%. 


A simple analysis shows how link mis-estimation de- 
grades throughput. Suppose a node uses ET’X [5] to pick 
the next-hop. Suppose further that there are two choices, 
one with link delivery probability p; and the other with 
probability p2; without loss of generality, let p; > po. 
ETX would choose link 1, with metric 1/p,. 


Let the error in the average link delivery probability 
estimate be 6 (Figure 16 (right) shows that 6 > 0.25 
in some cases). The node would pick the wrong link if 
P2+0 > p; —6. In this case, the extra number of trans- 
missions relative to the optimal value is op — oe The 
overhead relative to the optimal is “ — 1, which can 
be large for practical parameters; e.g., if p) = 0.8 and 


P2 = 0.6, the throughput reduction is 33%. 
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Figure 16: Left: Packet delivery rate for 6 Mbps packets over time; the raised dashed hint line indicates the device 1s 
moving. Middle & Right: Average error in delivery probability versus probing rate for static and mobile cases. 
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Figure 17: Delivery probability over time for the mobile trace (left) and for a combined static and mobile trace with 


the dots showing the movement hint (right). 


6 VEHICULAR NETWORK PATH SELECTION 


We now investigate whether hints can improve path se- 
lection in vehicular mesh networks. Networking strate- 
gies in this setting are complicated by dynamic neighbor 
tables, which can generate a high rate of broken paths. 
In general, broken paths increase overhead and latency. 
For this reason, selecting paths with the longest expected 
connection time may be a good idea. 


6.1 Connection Time Estimate Metrics 


We present three connection time estimate (CTE) metrics 
that use heading and speed hints to estimate whether a 
path in a vehicular network is likely to be long-lived. Let 
the ordered sequence (u1,...,u;) denote a j — 1 hop path, 
dh(u;,uj+1) denote the difference in heading of nodes u; 
and uj+; (from 0 to 180 degrees), and s(u;) denote the 
speed of u; (in m/s). Our three CTE metrics, called ctel, 
cte2, and cte3, are defined for a path R = (uy,..., uj): 


1 
ctel(R) = —__—__—_— 
Uj Ui+1ER dh( uj, ui+1) 
te2(R) 
cte = min | ——— 
ujuiziER \ dh(uj, Ui+1) 
1 
cte3(R) =  ctel(R) 


(14+Duers(ui) 


The metrics use the assumption that a small differ- 
ence in heading indicates nodes are moving in the same 
direction on the same road, and are therefore likely to 
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stay connected longer. The cte3 metric includes the addi- 
tional assumption that speed is inversely correlated with 
connection duration. Because ctel multiplies the inverse 
of heading differences at each hop, it is biased toward 
single-hop paths. The cte2 metric, by contrast, evaluates 
a path only by its worst hop, scoring multi-hop paths 
higher than ctel. The cte3 metric multiples the inverse 
of the sum of node speeds with the cte1 value. It follows, 
for example, that doubling the speed of each node on a 
path approximately halves its cte3 score. 

To calculate these metrics, each node appends its head- 
ing and speed to its mesh neighbor probes. For all three, 
larger values predict longer-lived paths. These metrics 
are simple, and require no knowledge of road geometry. 


6.2 Evaluation 


We evaluated these metrics over a collection of vehicu- 
lar mobility traces generated from raw position samples 
gathered from vehicles in the CarTel project over the du- 
ration of a year in the Boston metro area, map-matched to 
an underlying road network [23]. We combine a collec- 
tion of traces into a network, and then simulate, for each 
second, the position of every vehicle in the network, ad- 
justing the traces so they all appear to begin at the same 
time. We consider two vehicles to have a link at a given 
time if and only if they are within 100 meters at that time 
in their traces (we use geographic proximity as a crude 
surrogate for connectivity). 

We measured the relationship between CTE values 
and path duration over both one and two hop paths. 
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Specifically, we studied 15 networks consisting of 100 
vehicles each. Each simulation lasted for 120 seconds. 
For each of the over 190,000 routes observed in these 
simulations, we calculated all three CTE metrics when 
the path is first formed, and the total duration of the path 
(in seconds) before it breaks. For each metric, we bucket 
the CTE scores (into buckets of size 1/20 for ctel, 1/10 
for cte2, and 1/200 for cte3), and calculate the median 
link duration of the paths in each bucket. In Figure 18 
we plot these durations for the first three buckets (in de- 
scending order of associated CTE score) for each CTE 
metric. The dashed line indicates the median duration 
over all paths. 


The figure shows that all three CTE metrics provide an 
effective filter for long-lived paths. If a path’s ctel value 
falls into the first bucket, or if its cte2 or cte3 values fall 
into the first two buckets, then the path is likely to be 
long-lived. The median duration of paths in these buckets 
is 2—5 x longer than the median over all paths. 


Identifying long-lived paths might not be a good strat- 
egy if the selection mechanism is somehow biased to- 
ward routes with low throughput. To evaluate this pos- 
sibility, we use distance as a rough approximation of 
achievable throughput (we only have position data from 
the networks used in this evaluation). We plot in Fig- 
ure 19 the CDF of time versus distance for the single-hop 
paths in the first bucket of ctel, and the first two buckets 
of cte2, and cte3. For comparison we also plot the func- 
tion for all single-hop paths. This figure confirms that 
our CTE metrics show no bias favoring links of larger 
distances (lower throughput). 


7 LIMITATIONS 


Energy. Sampling sensors consumes energy and reduces 
the battery lifetime of a mobile device. Figure 20 shows 
the battery lifetime of an Android G1 device when vari- 
ous hints are sampled at the highest supported rates. No- 
tice that the accelerometer and compass consume much 
less energy than GPS. To alleviate energy concerns, pro- 
tocols should extract hints only when transferring data. 
Moreover, sensors like the accelerometer on a mobile 
device are usually always on by default (for instance, 
to continuously detect changes in screen orientation), so 
extracting hints from them should consume no extra en- 
ergy. Triggered sensing [10] can further reduce the en- 
ergy consumed by some sensors. Here, a low-power sen- 
sor turns on or off a high-power sensor based on certain 
events; for example, GPS can be turned on only when 
the accelerometer detects movement. We can also dy- 
namically reduce the sampling rate of sensors to reduce 
the energy cost [22, 23, 24], and replace expensive GPS 
with lower-energy position sensors like GSM radios, as 
in CTrack [24]. In addition, sensor hints can be turned 


off when the battery is low and protocols can revert to a 
hint-unaware scheme. 

Calibration across devices. The disparity between 
sensors across different devices and platforms might 
pose a challenge for hint-aware protocols to work with- 
out sensor calibration and tuning. We have implemented 
the Sensor Library for Android Nexus One, Android 
G1, and iPhones. The movement hint worked seamlessly 
across these platforms, but the walking hint detector [22] 
required a little tuning for each type of device. 

Privacy. Sharing mobility hints with other nodes 
might expose private information. For instance, by con- 
tinuously monitoring movement and heading hints, it 
might be possible to track a user’s behavior more accu- 
rately than by just monitoring wireless packets from a 
device (e.g., I might be able to determine more reliably 
that you left your office because of the movement hints 
broadcast by your device); one might alleviate this prob- 
lem by having all communication go via a (trusted) AP, 
and encrypting the hints sent to the AP. 


$8 RELATED WORK 


To the best of our knowledge, ours is the first practi- 
cal work to explore the benefits of systematically inte- 
grating sensor hints into a wireless network architecture. 
Related work that uses information outside the wireless 
networking stack has mostly focused on wireless power 
saving. For instance, Wake on Wireless [17] uses an ad- 
ditional low power radio that can be used for signaling to 
wake up the wireless radio. Cell2Notify [1] uses the cel- 
lular radio on a smartphone to wakeup the WiFi interface 
for VoIP calls thus reducing the energy consumption of 
WiFi. BlueFi [2] uses GSM towers and nearby Bluetooth 
devices to predict if WiFi connectivity is available, hence 
achieving power savings. 

In addition to power savings, hints from external sen- 
sors for wireless protocols have been used, usually in 
vehicular network designs. Mobisteer [12] uses direc- 
tional antennas in vehicles and location hints from GPS 
to find the best antenna orientation and the AP to asso- 
ciate with. Breadcrumbs [13] predicts the best AP to as- 
sociate with using a mobility model built using GPS co- 
ordinates. CARS [16] is an inter-vehicle bit rate adap- 
tation protocol that uses knowledge of the speed and 
distance between communicating cars to pick a bit rate. 
Their method collects a large amount of training data for 
an environment to determine the best bit rate to use at dif- 
ferent speeds and distances; in contrast our hint-aware bit 
rate adaptation method does not require any such training 
and performs well across a variety of conditions. 


9 CONCLUSION 


This paper introduced a network architecture that uses 
sensor hints to augment and improve wireless protocols. 
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Figure 18: The median route dura- 
tion for the highest three CTE value 
buckets. The dashed line is the me- 
dian over all routes. 


The key idea is to use these hints to infer the context in 
which communication is occurring, and to use that con- 
text to adapt the behavior of protocols. We applied this 
idea to develop hint-aware protocols for bit rate adapta- 
tion, access point association, topology maintenance, and 
path selection in vehicular networks. Sensor hints can 
also augment other protocols, as described in our earlier 
position paper [15]. These include: adapting the length of 
the cyclic prefix to outdoor speeds, scheduling client traf- 
fic at an AP taking movement into account, preemptively 
disassociating clients that have likely moved beyond the 
range of an AP, and network monitoring. 
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Abstract 


We present Mesos, a platform for sharing commod- 
ity clusters between multiple diverse cluster computing 
frameworks, such as Hadoop and MPI. Sharing improves 
cluster utilization and avoids per-framework data repli- 
cation. Mesos shares resources in a fine-grained man- 
ner, allowing frameworks to achieve data locality by 
taking turns reading data stored on each machine. To 
support the sophisticated schedulers of today’s frame- 
works, Mesos introduces a distributed two-level schedul- 
ing mechanism called resource offers. Mesos decides 
how many resources to offer each framework, while 
frameworks decide which resources to accept and which 
computations to run on them. Our results show that 
Mesos can achieve near-optimal data locality when shar- 
ing the cluster among diverse frameworks, can scale to 
50,000 (emulated) nodes, and is resilient to failures. 


1 Introduction 


Clusters of commodity servers have become a major 
computing platform, powering both large Internet ser- 
vices and a growing number of data-intensive scientific 
applications. Driven by these applications, researchers 
and practitioners have been developing a diverse array of 
cluster computing frameworks to simplify programming 
the cluster. Prominent examples include MapReduce 
[18], Dryad [24], MapReduce Online [17] (which sup- 
ports streaming jobs), Pregel [28] (a specialized frame- 
work for graph computations), and others [27, 19, 30]. 

It seems clear that new cluster computing frameworks! 
will continue to emerge, and that no framework will be 
optimal for all applications. Therefore, organizations 
will want to run multiple frameworks in the same cluster, 
picking the best one for each application. Multiplexing 
a cluster between frameworks improves utilization and 
allows applications to share access to large datasets that 
may be too costly to replicate across clusters. 


'By “framework,” we mean a software system that manages and 
executes one or more jobs on a cluster. 


Two common solutions for sharing a cluster today are 
either to statically partition the cluster and run one frame- 
work per partition, or to allocate a set of VMs to each 
framework. Unfortunately, these solutions achieve nei- 
ther high utilization nor efficient data sharing. The main 
problem is the mismatch between the allocation granular- 
ities of these solutions and of existing frameworks. Many 
frameworks, such as Hadoop and Dryad, employ a fine- 
grained resource sharing model, where nodes are subdi- 
vided into “slots” and jobs are composed of short tasks 
that are matched to slots [25, 38]. The short duration of 
tasks and the ability to run multiple tasks per node allow 
jobs to achieve high data locality, as each job will quickly 
get a chance to run on nodes storing its input data. Short 
tasks also allow frameworks to achieve high utilization, 
as jobs can rapidly scale when new nodes become avail- 
able. Unfortunately, because these frameworks are de- 
veloped independently, there is no way to perform fine- 
grained sharing across frameworks, making it difficult to 
share clusters and data efficiently between them. 

In this paper, we propose Mesos, a thin resource shar- 
ing layer that enables fine-grained sharing across diverse 
cluster computing frameworks, by giving frameworks a 
common interface for accessing cluster resources. 

The main design question for Mesos is how to build 
a scalable and efficient system that supports a wide ar- 
ray of both current and future frameworks. This is chal- 
lenging for several reasons. First, each framework will 
have different scheduling needs, based on its program- 
ming model, communication pattern, task dependencies, 
and data placement. Second, the scheduling system must 
scale to clusters of tens of thousands of nodes running 
hundreds of jobs with millions of tasks. Finally, because 
all the applications in the cluster depend on Mesos, the 
system must be fault-tolerant and highly available. 

One approach would be for Mesos to implement a cen- 
tralized scheduler that takes as input framework require- 
ments, resource availability, and organizational policies, 
and computes a global schedule for all tasks. While this 
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approach can optimize scheduling across frameworks, it 
faces several challenges. The first is complexity. The 
scheduler would need to provide a sufficiently expres- 
sive API to capture all frameworks’ requirements, and 
to solve an online optimization problem for millions 
of tasks. Even if such a scheduler were feasible, this 
complexity would have a negative impact on its scala- 
bility and resilience. Second, as new frameworks and 
new scheduling policies for current frameworks are con- 
stantly being developed [37, 38, 40, 26], it is not clear 
whether we are even at the point to have a full specifi- 
cation of framework requirements. Third, many existing 
frameworks implement their own sophisticated schedul- 
ing [25, 38], and moving this functionality to a global 
scheduler would require expensive refactoring. 

Instead, Mesos takes a different approach: delegating 
control over scheduling to the frameworks. This is ac- 
complished through a new abstraction, called a resource 
offer, which encapsulates a bundle of resources that a 
framework can allocate on a cluster node to run tasks. 
Mesos decides how many resources to offer each frame- 
work, based on an organizational policy such as fair shar- 
ing, while frameworks decide which resources to accept 
and which tasks to run on them. While this decentral- 
ized scheduling model may not always lead to globally 
optimal scheduling, we have found that it performs sur- 
prisingly well in practice, allowing frameworks to meet 
goals such as data locality nearly perfectly. In addition, 
resource offers are simple and efficient to implement, al- 
lowing Mesos to be highly scalable and robust to failures. 


Mesos also provides other benefits to practitioners. 
First, even organizations that only use one framework 
can use Mesos to run multiple instances of that frame- 
work in the same cluster, or multiple versions of the 
framework. Our contacts at Yahoo! and Facebook in- 
dicate that this would be a compelling way to isolate 
production and experimental Hadoop workloads and to 
roll out new versions of Hadoop [11, 10]. Second, 
Mesos makes it easier to develop and immediately ex- 
periment with new frameworks. The ability to share re- 
sources across multiple frameworks frees the developers 
to build and run specialized frameworks targeted at par- 
ticular problem domains rather than one-size-fits-all ab- 
stractions. Frameworks can therefore evolve faster and 
provide better support for each problem domain. 

We have implemented Mesos in 10,000 lines of C++. 
The system scales to 50,000 (emulated) nodes and uses 
ZooKeeper [4] for fault tolerance. To evaluate Mesos, we 
have ported three cluster computing systems to run over 
it: Hadoop, MPI, and the Torque batch scheduler. To val- 
idate our hypothesis that specialized frameworks provide 
value over general ones, we have also built a new frame- 
work on top of Mesos called Spark, optimized for itera- 
tive jobs where a dataset is reused in many parallel oper- 
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Figure 1: CDF of job and task durations in Facebook’s Hadoop 
data warehouse (data from [38]). 


ations, and shown that Spark can outperform Hadoop by 
10x in iterative machine learning workloads. 


This paper is organized as follows. Section 2 details 
the data center environment that Mesos is designed for. 
Section 3 presents the architecture of Mesos. Section 4 
analyzes our distributed scheduling model (resource of- 
fers) and characterizes the environments that it works 
well in. We present our implementation of Mesos in Sec- 
tion 5 and evaluate it in Section 6. We survey related 
work in Section 7. Finally, we conclude in Section 8. 


2 ‘Target Environment 


As an example of a workload we aim to support, con- 
sider the Hadoop data warehouse at Facebook [5]. Face- 
book loads logs from its web services into a 2000-node 
Hadoop cluster, where they are used for applications 
such as business intelligence, spam detection, and ad 
optimization. In addition to “production” jobs that run 
periodically, the cluster is used for many experimental 
jobs, ranging from multi-hour machine learning compu- 
tations to 1-2 minute ad-hoc queries submitted interac- 
tively through an SQL interface called Hive [3]. Most 
jobs are short (the median job being 84s long), and the 
jobs are composed of fine-grained map and reduce tasks 
(the median task being 23s), as shown in Figure 1. 


To meet the performance requirements of these jobs, 
Facebook uses a fair scheduler for Hadoop that takes ad- 
vantage of the fine-grained nature of the workload to al- 
locate resources at the level of tasks and to optimize data 
locality [38]. Unfortunately, this means that the cluster 
can only run Hadoop jobs. If a user wishes to write an ad 
targeting algorithm in MPI instead of MapReduce, per- 
haps because MPI is more efficient for this job’s commu- 
nication pattern, then the user must set up a separate MPI 
cluster and import terabytes of data into it. This problem 
is not hypothetical; our contacts at Yahoo! and Facebook 
report that users want to run MPI and MapReduce Online 
(a streaming MapReduce) [11, 10]. Mesos aims to pro- 
vide fine-grained sharing between multiple cluster com- 
puting frameworks to enable these usage scenarios. 
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Figure 2: Mesos architecture diagram, showing two running 
frameworks (Hadoop and MPI). 





3 Architecture 


We begin our description of Mesos by discussing our de- 
sign philosophy. We then describe the components of 
Mesos, our resource allocation mechanisms, and how 
Mesos achieves isolation, scalability, and fault tolerance. 


3.1 Design Philosophy 


Mesos aims to provide a scalable and resilient core for 
enabling various frameworks to efficiently share clusters. 
Because cluster frameworks are both highly diverse and 
rapidly evolving, our overriding design philosophy has 
been to define a minimal interface that enables efficient 
resource sharing across frameworks, and otherwise push 
control of task scheduling and execution to the frame- 
works. Pushing control to the frameworks has two bene- 
fits. First, it allows frameworks to implement diverse ap- 
proaches to various problems in the cluster (e.g., achiev- 
ing data locality, dealing with faults), and to evolve these 
solutions independently. Second, it keeps Mesos simple 
and minimizes the rate of change required of the system, 
which makes it easier to keep Mesos scalable and robust. 
Although Mesos provides a low-level interface, we ex- 
pect higher-level libraries implementing common func- 
tionality (such as fault tolerance) to be built on top of 
it. These libraries would be analogous to library OSes in 
the exokernel [20]. Putting this functionality in libraries 
rather than in Mesos allows Mesos to remain small and 
flexible, and lets the libraries evolve independently. 


3.2 Overview 


Figure 2 shows the main components of Mesos. Mesos 
consists of a master process that manages slave daemons 
running on each cluster node, and frameworks that run 
tasks on these slaves. 

The master implements fine-grained sharing across 
frameworks using resource offers. Each resource offer 
is a list of free resources on multiple slaves. The master 
decides how many resources to offer to each framework 
according to an organizational policy, such as fair sharing 
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Figure 3: Resource offer example. 


or priority. To support a diverse set of inter-framework 
allocation policies, Mesos lets organizations define their 
own policies via a pluggable allocation module. 

Each framework running on Mesos consists of two 
components: a scheduler that registers with the master 
to be offered resources, and an executor process that is 
launched on slave nodes to run the framework’s tasks. 
While the master determines how many resources to of- 
fer to each framework, the frameworks’ schedulers select 
which of the offered resources to use. When a framework 
accepts offered resources, it passes Mesos a description 
of the tasks it wants to launch on them. 

Figure 3 shows an example of how a framework gets 
scheduled to run tasks. In step (1), slave 1 reports 
to the master that it has 4 CPUs and 4 GB of mem- 
ory free. The master then invokes the allocation mod- 
ule, which tells it that framework 1 should be offered 
all available resources. In step (2), the master sends a 
resource offer describing these resources to framework 
1. In step (3), the framework’s scheduler replies to the 
master with information about two tasks to run on the 
slave, using (2 CPUs, 1 GB RAM) for the first task, and 
(1 CPUs, 2 GB RAM) for the second task. Finally, in 
step (4), the master sends the tasks to the slave, which al- 
locates appropriate resources to the framework’s execu- 
tor, which in turn launches the two tasks (depicted with 
dotted borders). Because 1 CPU and 1 GB of RAM are 
still free, the allocation module may now offer them to 
framework 2. In addition, this resource offer process re- 
peats when tasks finish and new resources become free. 

To maintain a thin interface and enable frameworks 
to evolve independently, Mesos does not require frame- 
works to specify their resource requirements or con- 
straints. Instead, Mesos gives frameworks the ability to 
reject offers. A framework can reject resources that do 
not satisfy its constraints in order to wait for ones that 
do. Thus, the rejection mechanism enables frameworks 
to support arbitrarily complex resource constraints while 
keeping Mesos simple and scalable. 

One potential challenge with solely using the rejec- 
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tion mechanism to satisfy all framework constraints 1s 
efficiency: a framework may have to wait a long time 
before it receives an offer satisfying its constraints, and 
Mesos may have to send an offer to many frameworks 
before one of them accepts it. To avoid this, Mesos also 
allows frameworks to set filters, which are Boolean pred- 
icates specifying that a framework will always reject cer- 
tain resources. For example, a framework might specify 
a whitelist of nodes it can run on. 

There are two points worth noting. First, filters repre- 
sent just a performance optimization for the resource of- 
fer model, as the frameworks still have the ultimate con- 
trol to reject any resources that they cannot express filters 
for and to choose which tasks to run on each node. Sec- 
ond, as we will show in this paper, when the workload 
consists of fine-grained tasks (e.g., in MapReduce and 
Dryad workloads), the resource offer model performs 
surprisingly well even in the absence of filters. In par- 
ticular, we have found that a simple policy called delay 
scheduling [38], in which frameworks wait for a limited 
time to acquire nodes storing their data, yields nearly op- 
timal data locality with a wait time of 1-5s. 

In the rest of this section, we describe how Mesos per- 
forms two key functions: resource allocation (83.3) and 
resource isolation (83.4). We then describe filters and 
several other mechanisms that make resource offers scal- 
able and robust (83.5). Finally, we discuss fault tolerance 
in Mesos (83.6) and summarize the Mesos API (83.7). 


3.3. Resource Allocation 


Mesos delegates allocation decisions to a pluggable al- 
location module, so that organizations can tailor alloca- 
tion to their needs. So far, we have implemented two 
allocation modules: one that performs fair sharing based 
on a generalization of max-min fairness for multiple re- 
sources [21] and one that implements strict priorities. 
Similar policies are used in Hadoop and Dryad [25, 38]. 

In normal operation, Mesos takes advantage of the 
fact that most tasks are short, and only reallocates re- 
sources when tasks finish. This usually happens fre- 
quently enough so that new frameworks acquire their 
share quickly. For example, if a framework’s share is 
10% of the cluster, it needs to wait approximately 10% 
of the mean task length to receive its share. However, 
if a cluster becomes filled by long tasks, e.g., due to a 
buggy job or a greedy framework, the allocation module 
can also revoke (kill) tasks. Before killing a task, Mesos 
gives its framework a grace period to clean it up. 

We leave it up to the allocation module to select the 
policy for revoking tasks, but describe two related mech- 
anisms here. First, while killing a task has a low impact 
on many frameworks (e.g., MapReduce), it is harmful for 
frameworks with interdependent tasks (e.g., MPI). We al- 
low these frameworks to avoid being killed by letting al- 
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location modules expose a guaranteed allocation to each 
framework—a quantity of resources that the framework 
may hold without losing tasks. Frameworks read their 
guaranteed allocations through an API call. Allocation 
modules are responsible for ensuring that the guaranteed 
allocations they provide can all be met concurrently. For 
now, we have kept the semantics of guaranteed alloca- 
tions simple: if a framework is below its guaranteed al- 
location, none of its tasks should be killed, and if it is 
above, any of its tasks may be killed. 

Second, to decide when to trigger revocation, Mesos 
must know which of the connected frameworks would 
use more resources if they were offered them. Frame- 
works indicate their interest in offers through an API call. 


3.4 Isolation 


Mesos provides performance isolation between frame- 
work executors running on the same slave by leveraging 
existing OS isolation mechanisms. Since these mecha- 
nisms are platform-dependent, we support multiple iso- 
lation mechanisms through pluggable isolation modules. 
We currently isolate resources using OS container 
technologies, specifically Linux Containers [9] and So- 
laris Projects [13]. These technologies can limit the 
CPU, memory, network bandwidth, and (in new Linux 
kernels) I/O usage of a process tree. These isolation tech- 
nologies are not perfect, but using containers is already 
an advantage over frameworks like Hadoop, where tasks 
from different jobs simply run in separate processes. 


3.5 Making Resource Offers Scalable and Robust 


Because task scheduling in Mesos is a distributed pro- 
cess, it needs to be efficient and robust to failures. Mesos 
includes three mechanisms to help with this goal. 

First, because some frameworks will always reject cer- 
tain resources, Mesos lets them short-circuit the rejection 
process and avoid communication by providing filters to 
the master. We currently support two types of filters: 
“only offer nodes from list L” and “only offer nodes with 
at least R resources free’’. However, other types of pred- 
icates could also be supported. Note that unlike generic 
constraint languages, filters are Boolean predicates that 
specify whether a framework will reject one bundle of 
resources on one node, so they can be evaluated quickly 
on the master. Any resource that does not pass a frame- 
work’s filter is treated exactly like a rejected resource. 

Second, because a framework may take time to re- 
spond to an offer, Mesos counts resources offered to a 
framework towards its allocation of the cluster. This is 
a strong incentive for frameworks to respond to offers 
quickly and to filter resources that they cannot use. 

Third, if a framework has not responded to an offer 
for a sufficiently long time, Mesos rescinds the offer and 
re-offers the resources to other frameworks. 
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Scheduler Callbacks Scheduler Actions 


resourceOffer(offerld, offers) 
offerRescinded(offerld) 
statusUpdate(taskld, status) 
slaveLost(slaveld) 


reply ToOffer(offerld, tasks) 
setNeedsOffers(bool) 
setFilters(filters) 
getGuaranteedShare() 
killTask(taskld) 


Executor Callbacks Executor Actions 


launchTask(taskDescriptor) 
killTask(taskld) 


sendStatus(taskld, status) 


Table 1: Mesos API functions for schedulers and executors. 


3.6 Fault Tolerance 


Since all the frameworks depend on the Mesos master, it 
is critical to make the master fault-tolerant. To achieve 
this, we have designed the master to be soft state, so that 
anew master can completely reconstruct its internal state 
from information held by the slaves and the framework 
schedulers. In particular, the master’s only state is the list 
of active slaves, active frameworks, and running tasks. 
This information is sufficient to compute how many re- 
sources each framework is using and run the allocation 
policy. We run multiple masters in a hot-standby config- 
uration using ZooKeeper [4] for leader election. When 
the active master fails, the slaves and schedulers connect 
to the next elected master and repopulate its state. 

Aside from handling master failures, Mesos reports 
node failures and executor crashes to frameworks’ sched- 
ulers. Frameworks can then react to these failures using 
the policies of their choice. 

Finally, to deal with scheduler failures, Mesos allows a 
framework to register multiple schedulers such that when 
one fails, another one is notified by the Mesos master to 
take over. Frameworks must use their own mechanisms 
to share state between their schedulers. 


3.7 API Summary 


Table 1 summarizes the Mesos API. The “callback” 
columns list functions that frameworks must implement, 
while “actions” are operations that they can invoke. 


4 Mesos Behavior 


In this section, we study Mesos’s behavior for different 
workloads. Our goal is not to develop an exact model of 
the system, but to provide a coarse understanding of its 
behavior, in order to characterize the environments that 
Mesos’s distributed scheduling model works well in. 

In short, we find that Mesos performs very well when 
frameworks can scale up and down elastically, tasks 
durations are homogeneous, and frameworks prefer all 
nodes equally (84.2). When different frameworks pre- 
fer different nodes, we show that Mesos can emulate a 
centralized scheduler that performs fair sharing across 
frameworks (84.3). In addition, we show that Mesos can 





handle heterogeneous task durations without impacting 
the performance of frameworks with short tasks (84.4). 
We also discuss how frameworks are incentivized to im- 
prove their performance under Mesos, and argue that 
these incentives also improve overall cluster utilization 
(84.5). We conclude this section with some limitations 
of Mesos’s distributed scheduling model (84.6). 


4.1 Definitions, Metrics and Assumptions 


In our discussion, we consider three metrics: 


e Framework ramp-up time: time it takes a new 
framework to achieve its allocation (e.g., fair share); 


e Job completion time: time it takes a job to complete, 
assuming one job per framework; 


e System utilization: total cluster utilization. 


We characterize workloads along two dimensions: elas- 
ticity and task duration distribution. An elastic frame- 
work, such as Hadoop and Dryad, can scale its resources 
up and down, i.e., it can start using nodes as soon as it 
acquires them and release them as soon its task finish. In 
contrast, a rigid framework, such as MPI, can start run- 
ning its jobs only after it has acquired a fixed quantity of 
resources, and cannot scale up dynamically to take ad- 
vantage of new resources or scale down without a large 
impact on performance. For task durations, we consider 
both homogeneous and heterogeneous distributions. 

We also differentiate between two types of resources: 
mandatory and preferred. A resource 1s mandatory if a 
framework must acquire it in order to run. For example, a 
graphical processing unit (GPU) is mandatory if a frame- 
work cannot run without access to GPU. In contrast, a re- 
source 1s preferred if a framework performs “better” us- 
ing it, but can also run using another equivalent resource. 
For example, a framework may prefer running on a node 
that locally stores its data, but may also be able to read 
the data remotely if it must. 

We assume the amount of mandatory resources re- 
quested by a framework never exceeds its guaranteed 
share. This ensures that frameworks will not deadlock 
waiting for the mandatory resources to become free.” For 
simplicity, we also assume that all tasks have the same re- 
source demands and run on identical slices of machines 
called slots, and that each framework runs a single job. 


4.2 Homogeneous Tasks 


We consider a cluster with n slots and a framework, f, 
that is entitled to & slots. For the purpose of this analy- 
Sis, we consider two distributions of the task durations: 
constant (i.e., all tasks have the same length) and expo- 
nential. Let the mean task duration be 7’, and assume that 


*In workloads where the mandatory resource demands of the ac- 
tive frameworks can exceed the capacity of the cluster, the allocation 
module needs to implement admission control. 
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Table 2: Ramp-up time, job completion time and utilization for both elastic and rigid frameworks, and for both constant and 
exponential task duration distributions. The framework starts with no slots. & is the number of slots the framework is entitled under 
the scheduling policy, and G7 represents the time it takes a job to complete assuming the framework gets all & slots at once. 





framework f runs a job which requires GkT' total com- 
putation time. That is, when the framework has k slots, 
it takes its job GT time to finish. 

Table 2 summarizes the job completion times and sys- 
tem utilization for the two types of frameworks and the 
two types of task length distributions. As expected, elas- 
tic frameworks with constant task durations perform the 
best, while rigid frameworks with exponential task dura- 
tion perform the worst. Due to lack of space, we present 
only the results here and include derivations in [23]. 


Framework ramp-up time: _ If task durations are con- 
stant, it will take framework f at most T’ time to acquire 
k; slots. This is simply because during a T’ interval, every 
Slot will become available, which will enable Mesos to 
offer the framework all & of its preferred slots. If the du- 
ration distribution is exponential, the expected ramp-up 
time can be as high as T' In k [23]. 


Job completion time: The expected completion time* 
of an elastic job is at most (1 + 3)T’, which is within T 
(i.e., the mean task duration) of the completion time of 
the job when it gets all its slots instantaneously. Rigid 
jobs achieve similar completion times for constant task 
durations, but exhibit much higher completion times for 
exponential job durations, i.e., (Ink + 3)T. This is sim- 
ply because it takes a framework 7’ ln k time on average 
to acquire all its slots and be able to start its job. 


System utilization: Elastic jobs fully utilize their al- 
located slots, because they can use every slot as soon 
as they get it. As a result, assuming infinite demand, a 
system running only elastic jobs is fully utilized. Rigid 
frameworks achieve slightly worse utilizations, as their 
jobs cannot start before they get their full allocations, and 
thus they waste the resources held while ramping up. 


4.3 Placement Preferences 


So far, we have assumed that frameworks have no slot 
preferences. In practice, different frameworks prefer dif- 
ferent nodes and their preferences may change over time. 
In this section, we consider the case where frameworks 
have different preferred slots. 

The natural question is how well Mesos will work 
compared to a central scheduler that has full information 


3When computing job completion time we assume that the last tasks 
of the job running on the framework’s & slots finish at the same time. 
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about framework preferences. We consider two cases: 
(a) there exists a system configuration in which each 
framework gets all its preferred slots and achieves its full 
allocation, and (b) there is no such configuration, i.e., the 
demand for some preferred slots exceeds the supply. 


In the first case, it 1s easy to see that, irrespective of the 
initial configuration, the system will converge to the state 
where each framework allocates its preferred slots after 
at most one 7’ interval. This is simple because during a 
T interval all slots become available, and as a result each 
framework will be offered its preferred slots. 


In the second case, there is no configuration in which 
all frameworks can satisfy their preferences. The key 
question in this case is how should one allocate the pre- 
ferred slots across the frameworks demanding them. In 
particular, assume there are p slots preferred by m frame- 
works, where framework 2 requests 7; such slots, and 
a r; > x. While many allocation policies are pos- 
sible, here we consider a weighted fair allocation policy 
where the weight associated with framework 27 is its 1n- 
tended total allocation, s;. In other words, assuming that 
each framework has enough demand, we aim to allocate 
p-si/()>,_., si) preferred slots to framework 7. 


The challenge in Mesos is that the scheduler does 
not know the preferences of each framework. Fortu- 
nately, it turns out that there is an easy way to achieve 
the weighted allocation of the preferred slots described 
above: simply perform lottery scheduling [36], offer- 
ing slots to frameworks with probabilities proportional to 
their intended allocations. In particular, when a slot be- 
comes available, Mesos can offer that slot to framework 2 
with probability s;/(~,"_, 5;), where n is the total num- 
ber of frameworks in the system. Furthermore, because 
each framework 7 receives on average s; slots every T’ 
time units, the results for ramp-up times and completion 
times in Section 4.2 still hold. 


4.4 Heterogeneous Tasks 


So far we have assumed that frameworks have homo- 
geneous task duration distributions, i.e., that all frame- 
works have the same task duration distribution. In this 
section, we discuss frameworks with heterogeneous task 
duration distributions. In particular, we consider a work- 
load where tasks that are either short and long, where the 
mean duration of the long tasks is significantly longer 
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than the mean of the short tasks. Such heterogeneous 
workloads can hurt frameworks with short tasks. In the 
worst case, all nodes required by a short job might be 
filled with long tasks, so the job may need to wait a long 
time (relative to its execution time) to acquire resources. 
We note first that random task assignment can work 
well if the fraction @ of long tasks is not very close to 1 
and if each node supports multiple slots. For example, 
in a cluster with S' slots per node, the probability that a 
node is filled with long tasks will be @°. When S is large 
(e.g., in the case of multicore machines), this probability 
is small even with @ > 0.5. If S = 8 and ¢ = 0.5, for ex- 
ample, the probability that a node is filled with long tasks 
is 0.4%. Thus, a framework with short tasks can still ac- 
quire many preferred slots in a short period of time. In 
addition, the more slots a framework is able to use, the 
likelier it is that at least k of them are running short tasks. 
To further alleviate the impact of long tasks, Mesos 
can be extended slightly to allow allocation policies to 
reserve some resources on each node for short tasks. In 
particular, we can associate a maximum task duration 
with some of the resources on each node, after which 
tasks running on those resources are killed. These time 
limits can be exposed to the frameworks in resource of- 
fers, allowing them to choose whether to use these re- 
sources. This scheme is similar to the common policy of 
having a separate queue for short jobs in HPC clusters. 


4.5 Framework Incentives 


Mesos implements a decentralized scheduling model, 
where each framework decides which offers to accept. 
As with any decentralized system, it is important to un- 
derstand the incentives of entities in the system. In this 
section, we discuss the incentives of frameworks (and 
their users) to improve the response times of their jobs. 


Short tasks: A framework is incentivized to use short 
tasks for two reasons. First, it will be able to allocate any 
resources reserved for short slots. Second, using small 
tasks minimizes the wasted work if the framework loses 
a task, either due to revocation or simply due to failures. 


Scale elastically: The ability of a framework to use re- 
sources as soon as it acquires them—instead of waiting 
to reach a given minimum allocation—would allow the 
framework to start (and complete) its jobs earlier. In ad- 
dition, the ability to scale up and down allows a frame- 
work to grab unused resources opportunistically, as it can 
later release them with little negative impact. 


Do not accept unknown resources: Frameworks are 
incentivized not to accept resources that they cannot use 
because most allocation policies will count all the re- 
sources that a framework owns when making offers. 

We note that these incentives align well with our goal 
of improving utilization. If frameworks use short tasks, 


Mesos can reallocate resources quickly between them, 
reducing latency for new jobs and wasted work for revo- 
cation. If frameworks are elastic, they will opportunis- 
tically utilize all the resources they can obtain. Finally, 
if frameworks do not accept resources that they do not 
understand, they will leave them for frameworks that do. 

We also note that these properties are met by many 
current cluster computing frameworks, such as MapRe- 
duce and Dryad, simply because using short independent 
tasks simplifies load balancing and fault recovery. 


4.6 Limitations of Distributed Scheduling 


Although we have shown that distributed scheduling 
works well in a range of workloads relevant to current 
cluster environments, like any decentralized approach, it 
can perform worse than a centralized scheduler. We have 
identified three limitations of the distributed model: 


Fragmentation: When tasks have heterogeneous re- 
source demands, a distributed collection of frameworks 
may not be able to optimize bin packing as well as a cen- 
tralized scheduler. However, note that the wasted space 
due to suboptimal bin packing is bounded by the ratio be- 
tween the largest task size and the node size. Therefore, 
clusters running “larger” nodes (e.g., multicore nodes) 
and “smaller” tasks within those nodes will achieve high 
utilization even with distributed scheduling. 

There is another possible bad outcome if allocation 
modules reallocate resources in a naive manner: when 
a cluster is filled by tasks with small resource require- 
ments, a framework f with large resource requirements 
may starve, because whenever a small task finishes, f 
cannot accept the resources freed by it, but other frame- 
works can. To accommodate frameworks with large per- 
task resource requirements, allocation modules can sup- 
port a minimum offer size on each slave, and abstain from 
offering resources on the slave until this amount is free. 


Interdependent framework constraints: It is possi- 
ble to construct scenarios where, because of esoteric in- 
terdependencies between frameworks (e.g., certain tasks 
from two frameworks cannot be colocated), only a sin- 
gle global allocation of the cluster performs well. We 
argue such scenarios are rare in practice. In the model 
discussed in this section, where frameworks only have 
preferences over which nodes they use, we showed that 
allocations approximate those of optimal schedulers. 


Framework complexity: Using resource offers may 
make framework scheduling more complex. We argue, 
however, that this difficulty is not onerous. First, whether 
using Mesos or a centralized scheduler, frameworks need 
to know their preferences; in a centralized scheduler, 
the framework needs to express them to the scheduler, 
whereas in Mesos, it must use them to decide which of- 
fers to accept. Second, many scheduling policies for ex- 
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isting frameworks are online algorithms, because frame- 
works cannot predict task times and must be able to han- 
dle failures and stragglers [18, 40, 38]. These policies 
are easy to implement over resource offers. 


5 Implementation 


We have implemented Mesos in about 10,000 lines of 
C++. The system runs on Linux, Solaris and OS X, and 
supports frameworks written in C++, Java, and Python. 

To reduce the complexity of our implementation, we 
use a C++ library called libprocess [7] that provides 
an actor-based programming model using efficient asyn- 
chronous I/O mechanisms (epoll, kqueue, etc). We 
also use ZooKeeper [4] to perform leader election. 

Mesos can use Linux containers [9] or Solaris projects 
[13] to isolate tasks. We currently isolate CPU cores and 
memory. We plan to leverage recently added support for 
network and I/O isolation in Linux [8] in the future. 

We have implemented four frameworks on top of 
Mesos. First, we have ported three existing cluster com- 
puting systems: Hadoop [2], the Torque resource sched- 
uler [33], and the MPICH2 implementation of MPI [16]. 
None of these ports required changing these frameworks’ 
APIs, so all of them can run unmodified user programs. 
In addition, we built a specialized framework for iterative 
jobs called Spark, which we discuss in Section 5.3. 


5.1 Hadoop Port 


Porting Hadoop to run on Mesos required relatively few 
modifications, because Hadoop’s fine-grained map and 
reduce tasks map cleanly to Mesos tasks. In addition, the 
Hadoop master, known as the JobTracker, and Hadoop 
slaves, known as TaskTrackers, fit naturally into the 
Mesos model as a framework scheduler and executor. 

To add support for running Hadoop on Mesos, we took 
advantage of the fact that Hadoop already has a plug- 
gable API for writing job schedulers. We wrote a Hadoop 
scheduler that connects to Mesos, launches TaskTrackers 
as its executors, and maps each Hadoop task to a Mesos 
task. When there are unlaunched tasks in Hadoop, our 
scheduler first starts Mesos tasks on the nodes of the 
cluster that it wants to use, and then sends the Hadoop 
tasks to them using Hadoop’s existing internal interfaces. 
When tasks finish, our executor notifies Mesos by listen- 
ing for task finish events using an API in the TaskTracker. 

We used delay scheduling [38] to achieve data locality 
by waiting for slots on the nodes that contain task in- 
put data. In addition, our approach allowed us to reuse 
Hadoop’s existing logic for re-scheduling of failed tasks 
and for speculative execution (straggler mitigation). 

We also needed to change how map output data is 
served to reduce tasks. Hadoop normally writes map 
output files to the local filesystem, then serves these to 
reduce tasks using an HTTP server included in the Task- 
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Tracker. However, the TaskTracker within Mesos runs 
as an executor, which may be terminated if it is not run- 
ning tasks. This would make map output files unavailable 
to reduce tasks. We solved this problem by providing a 
shared file server on each node in the cluster to serve 
local files. Such a service is useful beyond Hadoop, to 
other frameworks that write data locally on each node. 
In total, our Hadoop port is 1500 lines of code. 


5.2 Torque and MPI Ports 


We have ported the Torque cluster resource manager to 
run as a framework on Mesos. The framework consists 
of a Mesos scheduler and executor, written in 360 lines 
of Python code, that launch and manage different com- 
ponents of Torque. In addition, we modified 3 lines of 
Torque source code to allow it to elastically scale up and 
down on Mesos depending on the jobs in its queue. 
After registering with the Mesos master, the frame- 
work scheduler configures and launches a Torque server 
and then periodically monitors the server’s job queue. 
While the queue is empty, the scheduler releases all tasks 
(down to an optional minimum, which we set to 0) and 
refuses all resource offers it receives from Mesos. Once 
a job gets added to Torque’s queue (using the standard 
qsub command), the scheduler begins accepting new 
resource offers. As long as there are jobs in Torque’s 
queue, the scheduler accepts offers as necessary to sat- 
isfy the constraints of as many jobs in the queue as pos- 
sible. On each node where offers are accepted, Mesos 
launches our executor, which in turn starts a Torque 
backend daemon and registers it with the Torque server. 
When enough Torque backend daemons have registered, 
the torque server will launch the next job in its queue. 
Because jobs that run on Torque (e.g. MPI) may not be 
fault tolerant, Torque avoids having its tasks revoked by 
not accepting resources beyond its guaranteed allocation. 
In addition to the Torque framework, we also created 
a Mesos MPI “wrapper” framework, written in 200 lines 
of Python code, for running MPI jobs directly on Mesos. 


5.3. Spark Framework 


Mesos enables the creation of specialized frameworks 
optimized for workloads for which more general exe- 
cution layers may not be optimal. To test the hypoth- 
esis that simple specialized frameworks provide value, 
we identified one class of jobs that were found to per- 
form poorly on Hadoop by machine learning researchers 
at our lab: iterative jobs, where a dataset is reused across 
a number of iterations. We built a specialized framework 
called Spark [39] optimized for these workloads. 

One example of an iterative algorithm used in ma- 
chine learning is logistic regression [22]. This algorithm 
seeks to find a line that separates two sets of labeled data 
points. The algorithm starts with a random line w. Then, 
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f(x,w) 





a) Dryad 
Figure 4: Data flow of a logistic regression job in Dryad 
vs. Spark. Solid lines show data flow within the framework. 
Dashed lines show reads from a distributed file system. Spark 
reuses in-memory data across iterations to improve efficiency. 


b) Spark 


on each iteration, it computes the gradient of an objective 
function that measures how well the line separates the 
points, and shifts w along this gradient. This gradient 
computation amounts to evaluating a function f(z, w) 
over each data point x and summing the results. An 
implementation of logistic regression in Hadoop must 
run each iteration as a separate MapReduce job, because 
each iteration depends on the w computed at the previous 
one. This imposes overhead because every iteration must 
re-read the input file into memory. In Dryad, the whole 
job can be expressed as a data flow DAG as shown in Fig- 
ure 4a, but the data must still must be reloaded from disk 
at each iteration. Reusing the data in memory between 
iterations in Dryad would require cyclic data flow. 


Spark’s execution is shown in Figure 4b. Spark uses 
the long-lived nature of Mesos executors to cache a slice 
of the dataset in memory at each executor, and then run 
multiple iterations on this cached data. This caching is 
achieved in a fault-tolerant manner: if a node is lost, 
Spark remembers how to recompute its slice of the data. 


By building Spark on top of Mesos, we were able to 
keep its implementation small (about 1300 lines of code), 
yet still capable of outperforming Hadoop by 10x for 
iterative jobs. In particular, using Mesos’s API saved us 
the time to write a master daemon, slave daemon, and 
communication protocols between them for Spark. The 
main pieces we had to write were a framework scheduler 
(which uses delay scheduling for locality) and user APIs. 


6 Evaluation 


We evaluated Mesos through a series of experiments on 
the Amazon Elastic Compute Cloud (EC2). We begin 
with a macrobenchmark that evaluates how the system 
shares resources between four workloads, and go on to 
present a series of smaller experiments designed to eval- 
uate overhead, decentralized scheduling, our specialized 
framework (Spark), scalability, and failure recovery. 


[Bin | Job Type | Map Tasks | Reduce Tasks | # Jobs Run 
2 text search 2 NA 18 


400 





Table 3: Job types for each bin in our Facebook Hadoop mix. 


6.1 Macrobenchmark 


To evaluate the primary goal of Mesos, which is enabling 
diverse frameworks to efficiently share a cluster, we ran a 
macrobenchmark consisting of a mix of four workloads: 


e A Hadoop instance running a mix of small and large 
jobs based on the workload at Facebook. 


e A Hadoop instance running a set of large batch jobs. 
e Spark running a series of machine learning jobs. 


e Torque running a series of MPI jobs. 


We compared a scenario where the workloads ran as 
four frameworks on a 96-node Mesos cluster using fair 
sharing to a scenario where they were each given a static 
partition of the cluster (24 nodes), and measured job re- 
sponse times and resource utilization in both cases. We 
used EC2 nodes with 4 CPU cores and 15 GB of RAM. 

We begin by describing the four workloads in more 
detail, and then present our results. 


6.1.1 Macrobenchmark Workloads 


Facebook Hadoop Mix Our Hadoop job mix was 
based on the distribution of job sizes and inter-arrival 
times at Facebook, reported in [38]. The workload con- 
sists of 100 jobs submitted at fixed times over a 25- 
minute period, with a mean inter-arrival time of 14s. 
Most of the jobs are small (1-12 tasks), but there are also 
large jobs of up to 400 tasks.* The jobs themselves were 
from the Hive benchmark [6], which contains four types 
of queries: text search, a simple selection, an aggrega- 
tion, and a join that gets translated into multiple MapRe- 
duce steps. We grouped the jobs into eight bins of job 
type and size (listed in Table 3) so that we could com- 
pare performance in each bin. We also set the framework 
scheduler to perform fair sharing between its jobs, as this 
policy is used at Facebook. 


Large Hadoop Mix ‘To emulate batch workloads that 
need to run continuously, such as web crawling, we had 
a second instance of Hadoop run a series of [O-intensive 
2400-task text search jobs. A script launched ten of these 
jobs, submitting each one after the previous one finished. 


4We scaled down the largest jobs in [38] to have the workload fit a 
quarter of our cluster size. 
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(a) Facebook Hadoop Mix (b) Large Hadoop Mix 
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Figure 5: Comparison of cluster shares (fraction of CPUs) over time for each of the frameworks in the Mesos and static partitioning 
macrobenchmark scenarios. On Mesos, frameworks can scale up when their demand is high and that of other frameworks is low, and 
thus finish jobs faster. Note that the plots’ time axes are different (e.g., the large Hadoop mix takes 3200s with static partitioning). 
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Spark We ran five instances of an iterative machine , 

learning job on Spark. These were launched by a script a 0.8 

that waited 2 minutes after each job ended to submit § ae 

the next. The job we used was alternating least squares 2 

(ALS), a collaborative filtering algorithm [42]. This job 5 %4 

is CPU-intensive but also benefits from caching its input § 0.2 

data on each node, and needs to broadcast updated pa- & a : 

rameters to all nodes running its tasks on each iteration. ; o0o. Zoe ec. oe ae 4660 scar 4GOD 
Torque / MPI Our Torque framework ran eight in- ne) 

stances of the tachyon raytracing job [35] that is part of Spark MM =» Facebook Hadoop Mix [mmm 
the SPEC MPI2007 benchmark. Six of the jobs ran small Large Hadoop Mix INI Torque / MPI 


problem sizes and two ran large ones. Both types used 24 Figure 6: Framework shares on Mesos during the macrobench- 


parallel tasks. We submitted these jobs at fixed times to mark. By pooling resources, Mesos lets each workload scale 
both clusters. The tachyon job is CPU-intensive. up to fill gaps in the demand of others. In addition, fine-grained 

sharing allows resources to be reallocated in tens of seconds. 
6.1.2 Macrobenchmark Results 


A successful result for Mesos would show two things: and 17% for memory), as shown in Figure 7. 
that Mesos achieves higher utilization than static parti- A second question is how much better jobs perform 
tioning, and that jobs finish at least as fast in the shared under Mesos than when using a statically partitioned 
cluster as they do in their static partition, and possibly — cluster. We present this data in two ways. First, Fig- 
faster due to gaps in the demand of other frameworks. ure 5 compares the resource allocation over time of 
Our results show both effects, as detailed below. each framework in the shared and statically partitioned 
We show the fraction of CPU cores allocated to each clusters. Shaded areas show the allocation in the stat- 
framework by Mesos over time in Figure 6. We see that ically partitioned cluster, while solid lines show the 


Mesos enables each framework to scale up during peri- share on Mesos. We see that the fine-grained frame- 
ods when other frameworks have low demands, and thus = works (Hadoop and Spark) take advantage of Mesos to 
keeps cluster nodes busier. For example, at time 350, scale up beyond 1/4 of the cluster when global demand 
when both Spark and the Facebook Hadoop framework allows this, and consequently finish bursts of submit- 
have no running jobs and Torque is using 1/8 of the clus- ted jobs faster in Mesos. At the same time, Torque 
ter, the large-job Hadoop framework scales up to 7/8 of | achieves roughly similar allocations and job durations 
the cluster. In addition, we see that resources are reallo- under Mesos (with some differences explained later). 

cated rapidly (e.g., when a Facebook Hadoop job starts Second, Tables 4 and 5 show a breakdown of job per- 
around time 360) due to the fine-grained nature of tasks. formance for each framework. In Table 4, we compare 
Finally, higher allocation of nodes also translates into in- the aggregate performance of each framework, defined 


creased CPU and memory utilization (by 10% for CPU as the sum of job running times, in the static partitioning 
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Figure 7: Average CPU and memory utilization over time 
across all nodes in the Mesos cluster vs. static partitioning. 
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Table 4: Aggregate performance of each framework in the mac- 
robenchmark (sum of running times of all the jobs in the frame- 
work). The speedup column shows the relative gain on Mesos. 


and Mesos scenarios. We see the Hadoop and Spark jobs 
as a whole are finishing faster on Mesos, while Torque is 
slightly slower. The framework that gains the most is the 
large-job Hadoop mix, which almost always has tasks to 
run and fills in the gaps in demand of the other frame- 
works; this framework performs 2x better on Mesos. 
Table 5 breaks down the results further by job type. 
We observe two notable trends. First, in the Facebook 
Hadoop mix, the smaller jobs perform worse on Mesos. 
This is due to an interaction between the fair sharing per- 
formed by Hadoop (among its jobs) and the fair sharing 
in Mesos (among frameworks): During periods of time 
when Hadoop has more than 1/4 of the cluster, if any jobs 
are submitted to the other frameworks, there is a delay 
before Hadoop gets a new resource offer (because any 
freed up resources go to the framework farthest below its 
share), so any small job submitted during this time is de- 
layed for a long time relative to its length. In contrast, 
when running alone, Hadoop can assign resources to the 
new job as soon as any of its tasks finishes. This prob- 
lem with hierarchical fair sharing is also seen 1n networks 
[34], and could be mitigated by running the small jobs on 
a separate framework or using a different allocation pol- 
icy (e.g., using lottery scheduling instead of offering all 
freed resources to the framework with the lowest share). 
Lastly, Torque is the only framework that performed 
worse, on average, on Mesos. The large tachyon jobs 
took on average 2 minutes longer, while the small ones 
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Table 5: Performance of each job type in the macrobenchmark. 
Bins for the Facebook Hadoop mix are in parentheses. 
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Figure 8: Data locality and average job durations for 16 
Hadoop instances running on a 93-node cluster using static par- 
titioning, Mesos, or Mesos with delay scheduling. 


took 20s longer. Some of this delay is due to Torque hav- 
ing to wait to launch 24 tasks on Mesos before starting 
each job, but the average time this takes 1s 12s. We be- 
lieve that the rest of the delay is due to stragglers (slow 
nodes). In our standalone Torque run, we saw two jobs 
take about 60s longer to run than the others (Fig. 5d). We 
discovered that both of these jobs were using a node that 
performed slower on single-node benchmarks than the 
others (in fact, Linux reported 40% lower bogomips on 
it). Because tachyon hands out equal amounts of work 
to each node, it runs as slowly as the slowest node. 


6.2 Overhead 


To measure the overhead Mesos imposes when a single 
framework uses the cluster, we ran two benchmarks us- 
ing MPI and Hadoop on an EC2 cluster with 50 nodes, 
each with 2 CPU cores and 6.5 GB RAM. We used the 
High-Performance LINPACK [15] benchmark for MPI 
and a WordCount job for Hadoop, and ran each job three 
times. The MPI job took on average 50.9s without Mesos 
and 51.8s with Mesos, while the Hadoop job took 160s 
without Mesos and 166s with Mesos. In both cases, the 
overhead of using Mesos was less than 4%. 


6.3. Data Locality through Delay Scheduling 


In this experiment, we evaluated how Mesos’ resource 
offer mechanism enables frameworks to control their 
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tasks’ placement, and in particular, data locality. We 
ran 16 instances of Hadoop using 93 EC2 nodes, each 
with 4 CPU cores and 15 GB RAM. Each node ran a 
map-only scan job that searched a 100 GB file spread 
throughout the cluster on a shared HDFS file system and 
outputted 1% of the records. We tested four scenarios: 
giving each Hadoop instance its own 5-6 node static par- 
tition of the cluster (to emulate organizations that use 
coarse-grained cluster sharing systems), and running all 
instances on Mesos using either no delay scheduling, Is 
delay scheduling or 5s delay scheduling. 

Figure 8 shows averaged measurements from the 16 
Hadoop instances across three runs of each scenario. Us- 
ing static partitioning yields very low data locality (18%) 
because the Hadoop instances are forced to fetch data 
from nodes outside their partition. In contrast, running 
the Hadoop instances on Mesos improves data locality, 
even without delay scheduling, because each Hadoop in- 
stance has tasks on more nodes of the cluster (there are 
4 tasks per node), and can therefore access more blocks 
locally. Adding a 1-second delay brings locality above 
90%, and a 5-second delay achieves 95% locality, which 
is competitive with running one Hadoop instance alone 
on the whole cluster. As expected, job performance im- 
proves with data locality: jobs run 1.7x faster in the 5s 
delay scenario than with static partitioning. 


6.4 Spark Framework 


We evaluated the benefit of running iterative jobs using 
the specialized Spark framework we developed on top 
of Mesos (Section 5.3) over the general-purpose Hadoop 
framework. We used a logistic regression job imple- 
mented in Hadoop by machine learning researchers in 
our lab, and wrote a second version of the job using 
Spark. We ran each version separately on 20 EC2 nodes, 
each with 4 CPU cores and 15 GB RAM. Each exper- 
iment used a 29 GB data file and varied the number of 
logistic regression iterations from | to 30 (see Figure 9). 

With Hadoop, each iteration takes 127s on average, 
because it runs as a separate MapReduce job. In contrast, 
with Spark, the first iteration takes 174s, but subsequent 
iterations only take about 6 seconds, leading to a speedup 
of up to 10x for 30 iterations. This happens because the 
cost of reading the data from disk and parsing it is much 
higher than the cost of evaluating the gradient function 
computed by the job on each iteration. Hadoop incurs the 
read/parsing cost on each iteration, while Spark reuses 
cached blocks of parsed data and only incurs this cost 
once. The longer time for the first iteration in Spark is 
due to the use of slower text parsing routines. 


6.5 Mesos Scalability 


To evaluate Mesos’ scalability, we emulated large clus- 
ters by running up to 50,000 slave daemons on 99 Ama- 
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Figure 9: Hadoop and Spark logistic regression running times. 
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Figure 10: Mesos master’s scalability versus number of slaves. 


zon EC2 nodes, each with 8 CPU cores and 6 GB RAM. 
We used one EC2 node for the master and the rest of the 
nodes to run slaves. During the experiment, each of 200 
frameworks running throughout the cluster continuously 
launches tasks, starting one task on each slave that it re- 
ceives a resource offer for. Each task sleeps for a period 
of time based on a normal distribution with a mean of 
30 seconds and standard deviation of 10s, and then ends. 
Each slave runs up to two tasks at a time. 

Once the cluster reached steady-state (i.e., the 200 
frameworks achieve their fair shares and all resources 
were allocated), we launched a test framework that runs a 
single 10 second task and measured how long this frame- 
work took to finish. This allowed us to calculate the extra 
delay incurred over 10s due to having to register with the 
master, wait for a resource offer, accept it, wait for the 
master to process the response and launch the task on a 
slave, and wait for Mesos to report the task as finished. 

We plot this extra delay in Figure 10, showing aver- 
ages of 5 runs. We observe that the overhead remains 
small (less than one second) even at 50,000 nodes. In 
particular, this overhead is much smaller than the aver- 
age task and job lengths in data center workloads (see 
Section 2). Because Mesos was also keeping the clus- 
ter fully allocated, this indicates that the master kept up 
with the load placed on it. Unfortunately, the EC2 vir- 
tualized environment limited scalability beyond 50,000 
slaves, because at 50,000 slaves the master was process- 
ing 100,000 packets per second (in+out), which has been 
shown to be the current achievable limit on EC2 [12]. 


6.6 Failure Recovery 


To evaluate recovery from master failures, we conducted 
an experiment with 200 to 4000 slave daemons on 62 
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EC2 nodes with 4 cores and 15 GB RAM. We ran 200 
frameworks that each launched 20-second tasks, and two 
Mesos masters connected to a 5-node ZooKeeper quo- 
rum. We synchronized the two masters’ clocks using NTP 
and measured the mean time to recovery (MTTR) after 
killing the active master. The MTTR 1s the time for all of 
the slaves and frameworks to connect to the second mas- 
ter. In all cases, the MTTR was between 4 and 8 seconds, 
with 95% confidence intervals of up to 3s on either side. 


6.7 Performance Isolation 


As discussed in Section 3.4, Mesos leverages existing 
OS isolation mechanism to provide performance isola- 
tion between different frameworks’ tasks running on the 
same slave. While these mechanisms are not perfect, 
a preliminary evaluation of Linux Containers [9] shows 
promising results. In particular, using Containers to iso- 
late CPU usage between a MediaWiki web server (con- 
sisting of multiple Apache processes running PHP) anda 
“hog” application (consisting of 256 processes spinning 
in infinite loops) shows on average only a 30% increase 
in request latency for Apache versus a 550% increase 
when running without Containers. We refer the reader to 
[29] for a fuller evaluation of OS isolation mechanisms. 


7 Related Work 


HPC and Grid Schedulers. The high performance 
computing (HPC) community has long been managing 
clusters [33, 41]. However, their target environment typ- 
ically consists of specialized hardware, such as Infini- 
band and SANs, where jobs do not need to be scheduled 
local to their data. Furthermore, each job is tightly cou- 
pled, often using barriers or message passing. Thus, each 
job is monolithic, rather than composed of fine-grained 
tasks, and does not change its resource demands during 
its lifetime. For these reasons, HPC schedulers use cen- 
tralized scheduling, and require users to declare the re- 
quired resources at job submission time. Jobs are then 
given coarse-grained allocations of the cluster. Unlike 
the Mesos approach, this does not allow jobs to locally 
access data distributed across the cluster. Furthermore, 
jobs cannot grow and shrink dynamically. In contrast, 
Mesos supports fine-grained sharing at the level of tasks 
and allows frameworks to control their placement. 

Grid computing has mostly focused on the problem 
of making diverse virtual organizations share geograph- 
ically distributed and separately administered resources 
in a secure and interoperable way. Mesos could well be 
used within a virtual organization inside a larger grid. 


Public and Private Clouds. Virtual machine clouds 
such as Amazon EC2 [1] and Eucalyptus [31] share 
common goals with Mesos, such as isolating applica- 
tions while providing a low-level abstraction (VMs). 
However, they differ from Mesos in several important 


ways. First, their relatively coarse grained VM allocation 
model leads to less efficient resource utilization and data 
sharing than in Mesos. Second, these systems generally 
do not let applications specify placement needs beyond 
the size of VM they require. In contrast, Mesos allows 
frameworks to be highly selective about task placement. 


Quincy. Quincy [25] is a fair scheduler for Dryad 
that uses a centralized scheduling algorithm for Dryad’s 
DAG-based programming model. In contrast, Mesos 
provides the lower-level abstraction of resource offers to 
support multiple cluster computing frameworks. 


Condor. The Condor cluster manager uses the Class- 
Ads language [32] to match nodes to jobs. Using a re- 
source specification language is not as flexible for frame- 
works as resource offers, since not all requirements may 
be expressible. Also, porting existing frameworks, which 
have their own schedulers, to Condor would be more dif- 
ficult than porting them to Mesos, where existing sched- 
ulers fit naturally into the two-level scheduling model. 


Next-Generation Hadoop. In February 2011, Ya- 
hoo! announced a redesign for Hadoop that uses a two- 
level scheduling model, where per-application masters 
request resources from a central manager [14]. The de- 
sign aims to support non-MapReduce applications too. 
While details about the scheduling model in this system 
are currently unavailable, we believe that the new appli- 
cation masters could naturally run as Mesos frameworks. 


$8 Conclusion and Future Work 


We have presented Mesos, a thin management layer that 
allows diverse cluster computing frameworks to effi- 
ciently share resources. Mesos is built around two de- 
sign elements: a fine-grained sharing model at the level 
of tasks, and a distributed scheduling mechanism called 
resource offers that delegates scheduling decisions to the 
frameworks. Together, these elements let Mesos achieve 
high utilization, respond quickly to workload changes, 
and cater to diverse frameworks while remaining scalable 
and robust. We have shown that existing frameworks 
can effectively share resources using Mesos, that Mesos 
enables the development of specialized frameworks pro- 
viding major performance gains, such as Spark, and that 
Mesos’s simple design allows the system to be fault tol- 
erant and to scale to 50,000 nodes. 

In future work, we plan to further analyze the re- 
source offer model and determine whether any exten- 
sions can improve its efficiency while retaining its flex- 
ibility. In particular, it may be possible to have frame- 
works give richer hints about offers they would like to 
receive. Nonetheless, we believe that below any hint 
system, frameworks should still have the ability to re- 
ject offers and to choose which tasks to launch on each 
resource, so that their evolution is not constrained by the 
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hint language provided by the system. 

We are also currently using Mesos to manage re- 
sources on a 40-node cluster in our lab and in a test de- 
ployment at Twitter, and plan to report on lessons from 
these deployments in future work. 
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Abstract— While today’s data centers are multiplexed 
across many non-cooperating applications, they lack effec- 
tive means to share their network. Relying on TCP’s con- 
gestion control, as we show from experiments in produc- 
tion data centers, opens up the network to denial of service 
attacks and performance interference. We present Seawall, 
a network bandwidth allocation scheme that divides net- 
work capacity based on an administrator-specified policy. 
Seawall computes and enforces allocations by tunneling 
traffic through congestion controlled, point to multipoint, 
edge to edge tunnels. The resulting allocations remain 
stable regardless of the number of flows, protocols, or 
destinations in the application’s traffic mix. Unlike alter- 
nate proposals, Seawall easily supports dynamic policy 
changes and scales to the number of applications and 
churn of today’s data centers. Through evaluation of a 
prototype, we show that Seawall adds little overhead and 
achieves strong performance isolation. 


1. INTRODUCTION 


Data centers are crucial to provide the large volumes of 
compute and storage resources needed by today’s Internet 
businesses including web search, content distribution and 
social networking. To achieve cost efficiencies and on- 
demand scaling, cloud data centers [5, 28] are highly- 
multiplexed shared environments, with VMs and tasks 
from multiple tenants coexisting in the same cluster. Since 
these applications come from unrelated customers, they 
are largely uncoordinated and mutually untrusting. Thus, 
the potential for network performance interference and 
denial of service attacks is high, and so performance 
predictability remains a key concern [8] for customers 
evaluating a move to cloud datacenters. 

While data centers provide many mechanisms to sched- 
ule local compute, memory, and disk resources [10, 15], 
existing mechanisms for apportioning network resources 
fall short. End host mechanisms such as TCP congestion 
control (or variants such as TFRC and DCCP) are widely 
deployed, scale to existing traffic loads, and, to a large 
extent, determine network sharing today via a notion of 
flow-based fairness. However, TCP does little to isolate 
tenants from one another: poorly-designed or malicious 
applications can consume network capacity, to the detri- 
ment of other applications, by opening more flows or us- 
ing non-compliant protocol implementations that ignore 
congestion control. Thus, while resource allocation using 
TCP is scalable and achieves high network utilization, it 


does not provide robust performance isolation. 

Switch and router mechanisms (e.g., CoS tags, 
Weighted Fair Queuing, reservations, QCN [29]) are bet- 
ter decoupled from tenant misbehavior. However, these 
features, inherited from enterprise networks and the In- 
ternet, are of limited use when applied to the demanding 
cloud data center environment, since they cannot keep up 
with the scale and the churn observed in datacenters (e.g., 
numbers of tenants, arrival rate of new VMs), can only 
obtain isolation at the cost of network utilization, or might 
require new hardware. 

For a better solution, we propose Seawall, an edge based 
mechanism that lets administrators prescribe how their 
network is shared. Seawall works irrespective of traffic 
characteristics such as the number of flows, protocols or 
participants. Seawall provides a simple abstraction: given 
a network weight for each local entity that serves as a traf- 
fic source (VM, process, etc.), Seawall ensures that along 
all network links, the share of bandwidth obtained by the 
entity is proportional to its weight. To achieve efficiency, 
Seawall is work-conserving, proportionally redistributing 
unused shares to currently active sources. 

Beyond simply improving security by mitigating DoS 
attacks from malicious tenants and generalizing exist- 
ing use-what-you-pay-for provisioning models, per-entity 
weights also enable better control over infrastructure ser- 
vices. Data centers often mix latency- and throughput- 
sensitive tasks with background infrastructure services. 
For instance, customer-generated web traffic contends 
with the demands of VM deployment and migration tasks. 
Per-entity weights obviate the need to hand-craft every 
individual service. 

Further, per-entity weights also enable better control 
over application-level goals. Network allocation deci- 
sions can have significant impact on end-to-end metrics 
such as completion time or throughput. For example, in 
a map-reduce cluster, a reduce task with a high fan-in 
can open up many more flows than map tasks sharing the 
same bottleneck. Flow-based fairness prioritizes high fan- 
in reduce tasks over other tasks, resulting in imbalanced 
progress that leaves CPU resources idle and degrades clus- 
ter throughput. By contrast, Seawall decouples network 
allocation from communications patterns. 

Seawall achieves scalable resource allocation by reduc- 
ing the network sharing problem to an instance of dis- 
tributed congestion control. The ubiquity of TCP shows 
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that such algorithms can scale to large numbers of partici- 
pants, adapt quickly to change, and can be implemented 
strictly at the edge. Though Seawall borrows from TCP, 
Seawall’s architecture and control loop ensure robustness 
against tenant misbehavior. Seawall uses a shim layer at 
the sender that makes policy compliance mandatory by 
forcing all traffic into congestion-controlled tunnels. To 
prevent tenants from bypassing Seawall, the shim runs in 
the virtualization or platform network stack, where it is 
well-isolated from tenant code. 

Simply enforcing a separate TCP-like tunnel to every 
destination would permit each source to achieve higher 
rate by communicating with more destinations. Since this 
does not achieve the desired policy based on per-entity 
weights, Seawall instead uses a novel control loop that 
combines feedback from multiple destinations. 

Overall, we make three contributions. First, we iden- 
tify problems and missed opportunities caused by poor 
network resource allocation. Second, we explore at length 
the tradeoffs in building network allocation mechanisms 
for cloud data centers. Finally, we design and implement 
an architecture and control loop that are robust against ma- 
licious, selfish, or buggy tenant behavior. We have built 
a prototype of Seawall as a Windows NDIS filter. From 
experiments in a large server cluster, we show that Sea- 
wall achieves proportional sharing of the network while 
remaining agnostic to tenant protocols and traffic patterns 
and protects against UDP- and TCP-based DoS attacks. 
Seawall provides these benefits while achieving line rate 
with low CPU overhead. 


2. PROBLEMS WITH NETWORK SHAR- 
ING IN DATACENTERS 


To understand the problems with existing network al- 
location schemes, we examine two types of clusters that 
consist of several thousands of servers and are used in 
production. The first type is that of public infrastructure 
cloud services that rent virtual machines along with other 
shared services such as storage and load balancers. In 
these datacenters, clients can submit arbitrary VM im- 
ages and choose which applications to run, who to talk to, 
how much traffic to send, when to send that traffic, and 
what protocols to use to exchange that traffic (TCP, UDP, 
# of flows). The second type is that of platform cloud 
services that support map-reduce workloads. Consider 
a map-reduce cluster that supports a search engine. It is 
used to analyze logs and improve query and advertisement 
relevance. Though this cluster is shared across many users 
and business groups, the execution platform (1.e., the job 
compiler and runtime) is proprietary code controlled by 
the datacenter provider. 

Through case studies on these datacenters we observe 
how the network is shared today, the problems that arise 
from such sharing and the requirements for an improved 
sharing mechanism. 
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In all datacenters, the servers have multiple cores, mul- 
tiple disks, and tens of GBs of RAM. The network is a 
tree like topology [26] with 20—40 servers in a rack and 
a small over-subscription factor on the upstream links of 
the racks. 


2.1 Performance interference in infrastructure 
cloud services 


Recent measurements demonstrate considerable varia- 
tion in network performance metrics — medium instances 
in EC2 experience throughput that can vary by 66% [25, 
43]. We conjecture, based on anecdotal evidence, that a 
primary reason for the variation is the inability to control 
the network traffic share of a VM. 

Unlike CPU and memory, network usage is harder to 
control because it is a distributed resource. For exam- 
ple, consider the straw man where each VM’s network 
share is statically limited to a portion of the host’s NIC 
rate (the equivalent of assigning the VM a fixed number 
of cores or a static memory size). A tenant with many 
VMs can cumulatively send enough traffic to overflow the 
receiver, some network link en route to that host, or other 
network bottlenecks. Some recent work [33] shows how 
to co-locate a trojan VM with a target VM. Using this, a 
malicious tenant can degrade the network performance of 
targeted victims. Finally, a selfish client, by using vari- 
able numbers of flows, or higher rate UDP flows, can hog 
network bandwidth. 

We note that out-of-band mechanisms to mitigate these 
problems exist. Commercial cloud providers employ a 
combination of such mechanisms. First, the provider 
can account for the network usage of tenants (and VMs) 
and quarantine or ban the misbehavers. Second, cloud 
providers might provide even less visibility into their 
clusters to make it harder for a malicious client to co- 
locate with target VMs. However, neither approach is fool- 
proof. Selfish or malicious traffic can mimic legitimate 
traffic, making it hard to distinguish. Further, obfuscation 
schemes may not stop a determined adversary. 

Our position, instead, is to get at the root of the problem. 
The reason existing solutions fail is that they primarily 
rely on TCP flows. But VMs are free to choose their 
number of flows, congestion control variant, and even 
whether they respond to congestion, allowing a small 
number of VMs to disproportionately impact the network. 
Hence, we seek alternative ways to share the network 
that are independent of the clients’ traffic matrices and 
implementations. 


2.2 Poorly-performing schedules in Cosmos 


We shift focus to Cosmos [9], a dedicated internal 
cluster that supports map-reduce workloads. We obtained 
detailed logs over several days from a production cluster 
with thousands of servers that supports the Bing search 
engine. The logs document the begin and end times of 


USENIX Association 


USENIX Association 


2% of tasks use > ——> 





0.8 150 flows 
oO 
> 
7 - 70% of tasks Aggr epale oot ae 
E 0.4 use 30-100 Partition iD 34 
6 flows Extract 8.8 2 
ve 20% of tasks Combine 2.3 1.0 
0 other 1.0 2 


1 50 100 150 
# of flows per task 
Figure 1: Distribution of the Figure 2: Variation 
number of flows per task in Cos- in number of flows per 
mos. task is due to the role 
of the task 


jobs, tasks and flows in this cluster. 

Performance interference happens here as well. In- 
stances of high network load are common. A few enti- 
ties (jobs, background services) contribute a substantial 
share of the traffic [22]. Tasks that move data over con- 
gested links suffer collateral damage — they are more 
likely to experience failures and become stragglers at the 
job level [6, 22]. 

Uniquely, however, we find that the de facto way of 
sharing the network leads to poor schedules. This is 
because schedulers for map-reduce platforms [27, 45] 
explicitly allocate local resources such as compute slots 
and memory. But, the underlying network primitives pre- 
vent them from exerting control over how tasks share the 
network. Map-reduce tasks naturally vary in the number 
of flows and the volume of data moved — a map task may 
have to read from just one location but a reduce task has 
to read data from all the map tasks in the preceding stage. 
Figure | shows that of the tasks that read data across 
racks, 20% of the tasks use just one flow, another 70% of 
the tasks vary between 30 and 100 flows, and 2% of the 
tasks use more than 150 flows. Figure 2 shows that this 
variation is due to the role of the task. 

Because reduce tasks use a large number of flows, they 
starve other tasks that share the same paths. Even if the 
scheduler is tuned to assign a large number of compute 
slots for map tasks, just a few reduce tasks will cause 
these map tasks to be bottlenecked on the network. Thus, 
the compute slots held by the maps make little progress. 

In principle, such unexpectedly idle slots could be put 
to better use on compute-heavy tasks or tasks that use 
less loaded network paths. However, current map-reduce 
schedulers do not support such load redistribution. ! 

A simple example illustrates this problem. Figure 3 
examines different ways of scheduling six tasks, five maps 
that each want to move | unit of data across a link of unit 
capacity and one reduce that wants to move 10 units of 
data from ten different locations over the same link. If 
the reduce uses 10 flows and each map uses 1 flow, as 
they do today, each of the flows obtains 7. ’th of the link 
bandwidth and all six tasks finish at t = 15 (the schedule 
shown in black). The total activity period, since each task 
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Figure 3: Poor sharing of the network leads to poor per- 
formance and wasted resources 


use local resources that no one else can use during the 
period it is active, is 6 * 15 = 90. 

If each task gets an even share of the link, it is easy to 
see that the map tasks will finish at t = 6 and the reduce 
task finishes at ¢ = 15. In this case, the total activity 
period is 5*6+ 1+ 15 = 45, or a 50% reduction in 
resource usage (the green solid line in Fig. 3). These 
spare resources can be used for other jobs or subsequent 
tasks within the same job. 

The preceding example shows how the inherent varia- 
tion in the way applications use the network can lead to 
poor schedules in the absence of control over how the net- 
work is shared. Our goal is to design ways of sharing the 
network that are efficient (no link goes idle if pent-up de- 
mand exists) and are independent of the traffic mix (UDP, 
#’s of TCP flows). 

We note that prescribing the optimal bandwidth shares 
is a non-goal for this paper. In fact, evenly allocating 
bandwidth across tasks is not optimal for some metrics. 
If the provider has perfect knowledge about demands, 
scheduling the shortest remaining transfer first will mini- 
mize the activity period [18]. Going back to the example, 
this means that the five map tasks get exclusive access 
to the link and finish one after the other resulting in an 
activity period of 30 (the red dashed line in Fig. 3). How- 
ever, this scheme has the side-effect of starving all the 
waiting transfers and requires perfect knowledge about 
client demands, which is hard to obtain in practice. 


2.3 Magnitude of scale and churn 
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Figure 4: Scale and churn seen in the observed datacenter. 


We attempt to understand the nature of the sharing prob- 
lem in production datacenters. We find that the number 
of classes to share bandwidth among is large and varies 
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frequently. Figure 4(a) shows the distribution of the num- 
ber of concurrent entities that share the examined Cosmos 
cluster. Note that the x-axis is in log scale. We see that at 
median, there are 500 stages (e.g., map, reduce, join), 10+ 
tasks and 10° flows in the cluster. The number of traffic 
classes required is at least two orders of magnitude larger 
than is feasible with current CoS tags or the number of 
WEQ/DRR queues that switches can handle per port. 

Figure 4(b) shows the distribution of the number of 
new atrivals in the observed cluster. Note that the x-axis 
is again in log scale. At median, 10 new stages, 10* new 
tasks and 5 « 10* new flows arrive in the cluster every 
minute. Anecdotal analysis of EC2, based on decoding 
the instance identifiers, concluded that O(10*) new VM 
instances are requested each day [34]. Updating VLANs 
or re-configuring switches whenever a VM arrives is sev- 
eral orders of magnitude more frequent than is achievable 
in today’s enterprise networks. 

Each of the observed data centers is large, with up to 
tens of thousands of servers, thousands of ToR switches, 
several tens of aggregation switches, load balancers, etc. 
Predicting traffic is easier in platform datacenters (e.g., 
Cosmos) wherein high level descriptions of the jobs are 
available. However, the scale and churn numbers indi- 
cate that obtaining up-to-date information (e.g., within a 
minute) may be a practical challenge. In cloud datacen- 
ters (e.g., EC2) traffic is even harder to predict because 
customer’s traffic is unconstrained and privacy concerns 
limit instrumentation. 


3. REQUIREMENTS 


From the above case studies and from interviews with 
operators of production clusters, we identify these require- 
ments for sharing the datacenter network. 

An ideal network sharing solution for datacenters has 
to scale, keep up with churn and retain high network 
utilization. It must do so without assuming well-behaved 
or TCP-compliant tenants. Since changes to the NICs and 
switches are expensive, take some time to standardize and 
deploy, and are hard to customize once deployed, edge- 
and software- based solutions are preferable. 

e Traffic Agnostic, Simple Service Interface: Tenants 
cannot be expected to know or curtail the nature of their 
traffic. It is good business sense to accommodate di- 
verse applications. While it is tempting to design shar- 
ing mechanisms that require tenants to specify a traffic 
matrix, 1.e., the pattern and volume of traffic between 
the tenant’s VMs, we find this to be an unrealistic bur- 
den. Changes in demands from the tenant’s customers 
and dynamics of their workload (e.g., map-reduce) will 
change the requirements. Hence, it is preferable to 
keep a thin service interface, e.g., have tenants choose 
a class of network service. 

e Require no changes to network topology or hard- 
ware: Recently, many data center network topologies 
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have been proposed [2, 3, 16, 21]. Cost benefit trade- 
offs indicate that the choice of topology depends on the 
intended usage. For example, EC2 recently introduced 
a full bisection bandwidth network for high perfor- 
mance computing (HPC); less expensive EC2 service 
levels continue to use the over-subscribed tree topol- 
ogy. To be widely applicable, mechanisms to share the 
network should be agnostic to network topology. 

e Scale to large numbers of tenants and high churn: 
To have practical benefit, any network sharing mecha- 
nism would need to scale to support the large workloads 
seen in real datacenters. 

e Enforce sharing without sacrificing efficiency: Stat- 
ically apportioning fractions of the bandwidth improves 
sharing at the cost of efficiency and can result in band- 
width fragmentation that makes it harder to accommo- 
date new tenants. At the same time, a tenant with pent 
up demand can use no more than its reservation even if 
the network is idle. 


To meet these requirements, Seawall relies on congestion- 
controlled tunnels implemented in the host but requires 
no per-flow state within switches. In this way, Seawall is 
independent of the physical data center network. Seawall 
does benefit from measurements at switches, if they are 
available. Seawall scales to large numbers of tenants and 
handles high churn, because provisioning new VMs or 
tasks is entirely transparent to the physical network. As 
tenants, VMs, or tasks come and go, there is no change 
to the physical network through signaling or configura- 
tion. Seawall’s design exploits the homogeneity of the 
data center environment, where end host software is easy 
to change and topology is predictable. These properties 
enable Seawall to use a system architecture and algorithms 
that are impractical on the Internet yet well-suited for data 
centers. 


4. Seawall DESIGN 


Seawall exposes the following abstraction. A network 
weight is associated with each entity that is sharing the 
network. The entity can be any traffic source that is con- 
fined to a single node, such as a VM, process, or col- 
lection of port numbers, but not a tenant or set of VMs. 
On each link in the network, Seawall provides the en- 
tity with a bandwidth share that is proportional to its 
weight; 1.e., an entity & with weight w; sending traffic 
over link / obtains this share of the total capacity of that 
link Share(k,l) = we . Here, Active(l) is the 


set of entities actively sending traffic across /. The alloca- 
tion 1s end-to-end, 1.e., traffic to a destination will be lim- 
ited by the smallest Share(k, 1) over links on the path to 
that destination. The allocation is also work-conserving: 
bandwidth that is unused because the entity needs less 
than its share or because its traffic is bottlenecked else- 
where is re-apportioned among other users of the link in 
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proportion to their weights. Here, we present a distributed 
technique that holds entities to these allocations while 
meeting our design requirements. 

Weights can be adjusted dynamically and allocations re- 
converge rapidly. The special case of assigning the same 
weight to all entities divides bandwidth in a max-min fair 
fashion. By specifying equal weights to VMs, a public 
cloud provider can avoid performance interference from 
misbehaving or selfish VMs (§2.1). We defer describing 
further ways to configure weights and enforcing global 
allocations, such as over a set of VMs belonging to the 
same tenant, to §4.6. 


4.1 Data path 


To achieve the desired sharing of the network, Sea- 
wall sends traffic through congestion-controlled logical 
tunnels. As shown in Figure 5, these tunnels are imple- 
mented within a shim layer that intercepts all packets 
entering and leaving the server. At the sender, each tunnel 
is associated with an allowed rate for traffic on that tunnel, 
implemented as a rate limiter. The receive end of the tun- 
nel monitors traffic and sends congestion feedback back 
to the sender. A bandwidth allocator corresponding to 
each entity uses feedback from all of the entity’s tunnels 
to adapt the allowed rate on each tunnel. The bandwidth 
allocators take the network weights as parameters, work 
independently of each other, and together ensure that the 
network allocations converge to their desired values. 

The Seawall shim layer is deployed to all servers in the 
data center by the management software that is respon- 
sible for provisioning and monitoring these servers (e.g., 
Autopilot, Azure Fabric). To ensure that only traffic 
controlled by Seawall enters the network, a provider can 
use attestation-based 802.1x authentication to disallow 
servers without the shim from connecting to the network. 

The feedback to the control loop is returned at regular 
intervals, spaced T’ apart. It includes both explicit control 
signals from the receivers as well as congestion feedback 
about the path. Using the former, a receiver can explicitly 
block or rate-limit unwanted traffic. Using the latter, the 
bandwidth allocators adapt allowed rate on the tunnels. To 
help the receiver prepare congestion feedback, the shim at 
the sender maintains a byte sequence number per tunnel 
(1.e., per (sending entity, destination) pair). The sender 
shim stamps outgoing packets with the corresponding 
tunnel’s current sequence number. The receiver detects 
losses in the same way as TCP, by looking for gaps in the 
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received sequence number space. At the end of an interval, 
the receiver issues feedback that reports the number of 
bytes received and the percentage of bytes deemed to 
be lost (Figure 6). Optionally, if ECN is enabled along 
the network path, the feedback also relays the fraction of 
packets received with congestion marks. 

We show efficient ways of stamping packets without 
adding a header and implementing queues and rate lim- 
iters in §5. Here, we describe the bandwidth allocator. 
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Class 1: A Strawman Bandwidth Allocator: an instance of 
this class is associated with each (entity, tunnel) pair. 
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4.2 Strawman 


Consider the strawman bandwidth allocator in Class 1. 
Recall that the goal of the bandwidth allocator is to con- 
trol the entity’s network allocation as per the entity’s net- 
work weight. Apart from the proportion variable, which 
we'll ignore for now, Class 1 is akin to weighted addi- 
tive increase, multiplicative decrease (AIMD). It works 
as follows: when feedback indicates loss, it multiplica- 
tively decreases the allowed rate by a. Otherwise, the rate 
increases by an additive constant. 

This simple strawman satisfies some of our require- 
ments. By making the additive increase step size a func- 
tion of the entity’s weight, the equilibrium rate achieved 
by an entity will be proportional to its weight. Unused 
shares are allocated to tunnels that have pent up demand, 
favoring efficiency over strict reservations. Global co- 
ordination is not needed. Further, when weights change, 
rates re-converge quickly (within one sawtooth period). 

We derive the distributed control loop in Class 1 from 
TCP-Reno though any other flow-oriented protocol [4, 1, 
29, 32] can be used, so long as it can extend to provide 
weighted allocations, as in MulTCP or MPAT [11, 39]. 
Distributed control loops are sensitive to variation in RTT. 
However, Seawall avoids this by using a constant feedback 
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Figure 7: When entities talk to different numbers of des- 
tinations, pair-wise allocation of bandwidth is not sufficient. 
Reduce tasks behave like the orange entity while maps re- 
semble the green. (Assume that both orange and green en- 
tities have the same weight.) 


period 7’, chosen to be larger than the largest RTT of the 
intra datacenter paths controlled by Seawall. Conserva- 
tively, Seawall considers no feedback within a period of T’ 
as if a feedback indicating loss was received. 

Simply applying AIMD, or any other distributed con- 
trol loop, on a per-tunnel basis does not achieve the de- 
sired per-link bandwidth distribution. Suppose a tenant 
has N VMs and opens flows between every pair of VMs. 
This results in a tunnel between each VM; with one AIMD 
loop per tunnel, thus each VM achieves O(N) times its 
allocation at the bottleneck link. Large tenants can over- 
whelm smaller tenants, as shown in Figure 7. 

Seawall improves on this simple strawman in three ways. 
First, it has a unique technique to combine feedback from 
multiple destinations. By doing so, an entity’s share of 
the network is governed by its network weight and is in- 
dependent of the number of tunnels it uses ($4.3). The 
resulting policy is consistent with how cloud providers 
allocate other resources, such as compute and memory, 
to a tenant, yet is a significant departure from prior ap- 
proaches to network scheduling. Second, the sawtooth 
behavior of AIMD leads to poor convergence on paths 
with high bandwidth-delay product. To mitigate this, Sea- 
wall modifies the adaptation logic to converge quickly and 
stay at equilibrium longer (§4.4). Third, we show how to 
nest traffic with different levels of responsiveness to con- 
gestion signals (e.g., TCP vs. UDP) within Seawall (§4.5). 


4.3 Seawall’s Bandwidth Allocator 


The bandwidth allocator, associated with each entity, 
takes as input the network weight of that entity, the con- 
gestion feedback from all the receivers that the entity is 
communicating with and generates the allowed rate on 
each of the entity’s tunnels. It has two parts: a distributed 
congestion control loop that computes the entity’s cumu- 
lative share on each link and a local scheduler that divides 
that share among the various tunnels. 


Step 1: Use distributed control loops to determine 
per-link, per-entity share. The ideal feedback would be 
per-link. It would include the cumulative usage of the en- 
tity across all the tunnels on this link, the total load on the 
link, and the network weights of all the entities using that 
link. Such feedback is possible if switches implement ex- 
plicit feedback (e.g., XCP, QCN) or from programmable 
switch sampling (e.g., SideCar [38]). Lacking these, the 
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baseline Seawall relies only on existing congestion signals 
such as end-to-end losses or ECN marks. These signals 
identify congested paths, rather than links. 

To approximate link-level congestion information us- 
ing path-level congestion signals, Seawall uses a heuristic 
based on the observation that a congested link causes 
losses in many tunnels using that link. The logic is de- 
scribed in Class 2. One instance of this class is associated 
with each entity and maintains separate per-link instances 
of the distributed control loop (rc;). Assume for now that 
rc 1s implemented as per the strawman Class 1, though 
we will replace it with Class 3. The sender shim stores 
the feedback from each destination, and once every pe- 
riod 7’, applies all the feedback cumulatively (lines 8-10). 
The heuristic scales the impact of feedback from a given 
destination in proportion to the volume of traffic sent to 
that destination by the shim in the last period (line 7, 10). 

To understand how this helps, consider the example 
in Figure 7. An instance of class 2, corresponding to 
the orange entity, cumulatively applies the feedback from 
all three destinations accessed via the bottleneck link to 
the single distributed control loop object representing 
that link. Since the proportions sum up to | across all 
destinations, the share of the orange entity will increase 
by only so much as that of the green entity. 

A simplification follows because the shim at the re- 
ceiver reports the fraction of bytes lost or marked. Hence, 
rather than invoking the distributed control loop once per 
destination, Class 2 computes just three numbers per link 
— the proportions of total feedback indicating loss, ECN 
marks, and neither, and invokes the distributed control 
loop once with each. 


: .Begin (weight W) 

{ rci.Begin(W) V links J used by sender } 
: .TakeFeedback (feedback fies) 

{ store feedback } 

: .Periodically () 


> Initialize 


: proportion of traffic to d, pa = ad 

: for all destinations d do 

for all links 1 € PathTo(d) do 
rcy.TakeFeedback( fa, pa) 

end for 


: end for 


i 
2: 
3 
4: 
5 
6: 
- 
8 


> rc, now contains per-link share for this entity 
> mz <— count of dest with paths through link / 
> rq is allowed rate to d 


> Ta * MINE PathTo(d) (Gz + 18) re.-rate ) 


Class 2: Seawall’s bandwidth allocator: A separate in- 
stance of this class is associated with each entity. It com- 
bines per-link distributed control loops (invoked in lines 2, 
10) with a local scheduler (line 16). 


Step 2: Convert per-link, per-entity shares to per-link, 
per-tunnel shares. Next, Seawall runs a local allocator to 


USENIX Association 


USENIX Association 


assign rate limits to each tunnel that respects the entity’s 
per-link rate constraints. A naive approach divides each 
link’s allowed rate evenly across all downstream desti- 
nations. For the example in Fig. 7, this leads to a L'rd 
share of the bottleneck link to the three destinations of 
the orange entity. This leads to wasted bandwidth if the 
demands across destinations vary. For example, if the 
orange entity has demands (2, x, x) to the three desti- 
nations and the bottleneck’s share for this entity is 4x, 
dividing evenly causes the first destination to get no more 
than a while bandwidth goes wasted. Hence, Seawall ap- 
portions link bandwidth to destinations as shown in line 
16, Class 2. The intuition is to adapt the allocations to 
match the demands. Seawall uses an exponential moving 
average that allocates 9 fraction of the link bandwidth 
proportional to current usage and the rest evenly across 
destinations. By default, we use 6 = .9. Revisiting the 
(2x,x,x) example, note that while the first destination 
uses up all of its allowed share, the other two destinations 
do not, causing the first to get a larger share in the next 
period. In fact, the allowed share of the first destination 
converges to within 20% of its demand in four iterations. 

Finally, Seawall converts these per-link, per-destination 
rate limits to a tunnel (1.e., per-path) rate limit by com- 
puting the minimum of the allowed rate on each link on 
the path. Note that Class 2 converges to a lower bound 
on the per-link allowed rate. At bottleneck links, this is 
tight. At other links, such as those used by the green 
flow in Figure 7 that are not the bottleneck, Class 2 can 
under-estimate their usable rate. Only when the green 
entity uses these other links on paths that do not overlap 
with the bottleneck, will the usable rate on those links 
increase. This behavior is the best that can be done using 
just path congestion signals and is harmless since the rate 
along each tunnel, computed as the minimum along each 
link on that path, is governed by the bottleneck. 


4.4 Improving the Rate Adaptation Logic 


Weighted AIMD suffers from inefficiencies as adap- 
tation periods increase, especially for paths with high 
bandwidth-delay product [23] such as those in datacen- 
ters. Seawall uses control laws from CUBIC [32] to 
achieve faster convergence, longer dwell time at the equi- 
librium point, and higher utilization than AIMD. As with 
weighted AIMD, Seawall modifies the control laws to sup- 
port weights and to incorporate feedback from multiple 
destinations. If switches support ECN, Seawall also in- 
corporates the control laws from DCTCP [4] to further 
smooth out the sawtooth and reduce queue utilization at 
the bottleneck, resulting in reduced latency, less packet 
loss, and improved resistance against incast collapse. 

The resulting control loop is shown in Class 3; the sta- 
bility follows from that of CUBIC and DCTCP. Though 
we describe a rate-based variant, the equivalent window 
based versions are feasible and we defer those to future 


: .Begin (weight W) 
{rater + I, weightw + W,c+0,inc+ 0} pb Init 
: .TakeFeedback (feedback f, proportion p) 


c+ ct+yxp*x (f.bytesMarked — c) 

> maintain smoothed estimate of congestion 
: if f.bytesMarked > 0 then 
Tnew <— T —T * a * px*c > Smoothed mult. decrease 
anc + 0 
tlastdrop <— now 


ee ee ae ee ee 


T+Tnew 


T goal — (r > Poet) cy i 2 


: else > Increase rate 
ifr < goal then P Less than goal, concave increase 


: now—t 
At = min (Sette 9) 


T's 
Ar = 6 * (Tgoat — 7) * (1 — At)? 
r<—rt+wxpxAr 
else > Above goal, convex increase 
r<r+px*inc 
inc < 1nc + W * 
end if 
: end if 
eS 
Class 3: Seawall’s distributed control loop: an instance of 
this class is associated with each (link, entity) pair. Note 
that Class 2 invokes this loop (lines 2, 10). 





work. We elaborate on parameter choices in §4.6. Lines 
14-17 cause the rate to increase along a concave curve, 1.e., 
quickly initially and then slower as rate nears 74oq1. After 
that, lines 18-19 implement convex increase to rapidly 
probe for a new rate. Line 5 maintains a smoothed es- 
timate of congestion, allowing multiplicative decreases 
to be modulated accordingly (line 8) so that the average 
queue size at the bottleneck stays small. 


4.5 Nesting Traffic Within Seawall 


Nesting traffic of different types within Seawall’s 
congestion-controlled tunnels leads to some special cases. 
If a sender always sends less than the rate allowed by 
Seawall, she may never see any loss causing her allowed 
rate to increase to infinity. This can happen if her flows 
are low rate (e.g., web traffic) or are limited by send or 
receive windows (flow control). Such a sender can launch 
a short overwhelming burst of traffic. Hence, Seawall 
clamps the rate allowed to a sender to a multiple of the 
largest rate she has used in the recent past. Clamping rates 
is common in many control loops, such as XCP [23], for 
similar reasons. The specific choice of clamp value does 
not matter as long as it is larger than the largest possible 
bandwidth increase during a Seawall change period. 

UDP and TCP flows behave differently under Seawall. 
While a full burst UDP flow immediately uses all the 
rate that a Seawall tunnel allows, a set of TCP flows can 
take several RTTs to ramp up; the more flows, the faster 
the ramp-up. Slower ramp up results in lower shares on 
average. Hence, Seawall modifies the network stack to 
defer congestion control to Seawall’s shim layer. All other 
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TCP functionality, such as flow control, loss recovery and 
in order delivery remain as before. 

The mechanics of re-factoring are similar to Congestion 
Manager (CM) [7]. Each TCP flow queries the appropri- 
ate rate limiter in the shim (e.g., using shared memory) to 
see whether a send is allowed. Flows that have a backlog 
of packets register callbacks with the shim to be notified 
when they can next send a packet. In virtualized settings, 
the TCP stack defers congestion control to the shim by 
expanding the paravirtualized NIC interface. Even for 
tenants that bring their own OSes, the performance gain 
from refactoring the stack incentivizes adoption. Some re- 
cent advances in designing device drivers [36] reduce the 
overhead of signaling across the VM boundary. However, 
Seawall uses this simplification that requires less signaling: 
using hypervisor IPCs, the shim periodically reports a 
maximum congestion window to each VM to use for all 
its flows. The max congestion window is chosen large 
enough that each VM will pass packets to the shim yet 
small enough to not overflow the queues in front of the 
rate limiters in the shim. 

We believe that deferring congestion control to the Sea- 
wall shim is necessary in the datacenter context. Enforcing 
network shares at the granularity of a flow no longer suf- 
fices (see §2). Though similar in spirit to Congestion 
Manager, Seawall refactors congestion control for differ- 
ent purposes. While CM does so to share congestion 
information among flows sharing a path, Seawall uses it to 
ensure that the network allocation policy holds regardless 
of the traffic mix. In addition, this approach allows for 
transparent changes to the datacenter transport. 


4.6 Discussion 


Here, we discuss details deferred from the preceding 
description of Seawall. 


Handling WAN traffic: Traffic entering and leaving the 
datacenter is subject to more stringent DoS scrubbing at 
pre-defined chokepoints and, because WAN bandwidth is 
a scarce resource, is carefully rate-limited, metered and 
billed. We do not expect Seawall to be used for such traffic. 
However, if required, edge elements in the datacenter, 
such as load balancers or gateways, can funnel all incom- 
ing traffic into Seawall tunnels; the traffic then traverses 
a shim within the edge element. Traffic leaving the data 
center is handled analogously. 


Mapping paths to links: To run Seawall, each sender 
requires path-to-link mapping for the paths that it is send- 
ing traffic on (line 10, Class 2). A sender can acquire this 
information independently, for example via a few tracer- 
outes. In practice, however, this is much easier. Data 
center networks are automatically managed by software 
that monitors and pushes images, software and configura- 
tion to every node [19, 28]. Topology changes (e.g., due 
to failures and reconfiguration) are rare and can be dis- 
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seminated automatically by these systems. Many pieces 
of today’s datacenter ecosystem use topology informa- 
tion (e.g., Map-Reduce schedulers [27] and VM place- 
ment algorithms). Note that Seawall does work with a 
partial mapping (e.g., a high level mapping of each server 
to its rack, container, VLAN and aggregation switch) and 
does not need to identify bottleneck links. However, path- 
to-link mapping is a key enabler; it lets Seawall run over 
any datacenter network topology. 


Choosing network weights: Seawall provides several 
ways to define the sending entity and the corresponding 
network weight. The precise choices depend on the dat- 
acenter type and application. When VMs are spun up in 
a cloud datacenter, the fabric sets the network weight of 
that VM alongside weights for CPU and memory. The 
fabric can change the VMs weight, if necessary, and Sea- 
wall re-converges rapidly. However, a VM cannot change 
its own weight. The administrator of a cloud datacenter 
can assign equal weights to all VMs, thereby avoiding 
performance interference, or assign weights in proportion 
to the size or price of the VM. 

In contrast, the administrator of a platform datacenter 
can empower trusted applications to adjust their weights 
at run-time (e.g., via setsockopt()). Here, Seawall can 
also be used to specify weights per executable (e.g., back- 
ground block replicator) or per process or per port ranges. 
The choice of weights could be based on information that 
the cluster schedulers have. For example, a map-reduce 
scheduler can assign the weight of each sender feeding 
a task in inverse proportion to the aggregation fan-in of 
that task, which he knows before hand. This ensures that 
each task obtains the same bandwidth (§2.2). Similarly, 
the scheduler can boost the weight of outlier tasks that 
are starved or are blocking many other tasks [6], thereby 
improving job completion times. 


Enforcing global allocations: Seawall has so far focused 
on enforcing the network share of a local entity (VM, task 
etc.). This is complementary to prior work on Distributed 
Rate Limiters (DRL) [31] that controls the aggregate rate 
achieved by a collection of entities. Controlling just the 
aggregate rate is vulnerable to DoS: a tenant might focus 
the traffic of all of its VMs on a shared service (such 
as storage) or link (e.g., ToR containing victim tenant’s 
servers), thereby interfering with the performance of other 
tenants while remaining under its global bandwidth cap. 
Combining Seawall with a global allocator such as DRL 
is simple. The Seawall shim reports each entity’s usage to 
the global controller in DRL, which employs its global 
policy on the collection of entities and determines what 
each entity is allowed to send. The shim then caps the rate 
allowed to that entity to the minimum of the rate allowed 
by Seawall and the rate allowed by DRL’s global policy. 
Further, the combination lets DRL scale better, since with 
Seawall, DRL need only track per-entity usage and not 
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Figure 8: The Seawall prototype is split into an in-kernel 
NDIS filter shim (shaded gray), which implements the rate 
limiting datapath, and a userspace rate adapter, which im- 
plements the control loop. Configuration shown is for in- 
frastructure data centers. 

















per-flow state that it would otherwise have to. 


Choosing parameters: Whenever we adapt past work, 
we follow their guidance for parameters. Of the parame- 
ters unique to Seawall, their specific values have the fol- 
lowing impact. We defer a formal analysis to future work. 
Reducing the feedback period 7’ makes Seawall’s adapta- 
tion logic more responsive at the cost of more overhead. 
We recommend choosing T' € [10,50] ms. The multi- 
plicative factor a controls the decrease rate. With the 
CUBIC/DCTCP control loop (see Class 3), Seawall is 
less sensitive to a than the AIMD control loop, since the 
former ramps back up more aggressively. In Class 2, 3 
controls how much link rate is apportioned evenly versus 
based on current usage. With a larger (, the control loop 
reacts more quickly to changing demands but delays ap- 
portioning unused rate to destinations that need it. We 
recommend ( > .8. 


5. Seawall PROTOTYPE 


The shim layer of our prototype is built as an NDIS 
packet filter (Figure 8). It interposes new code between 
the TCP/IP stack and the NIC driver. In virtualized set- 
tings, the shim augments the vswitch in the root partition. 
Our prototype is compatible with deployments that use 
the Windows 7 kernel as the server OS or as the root par- 
tition of Hyper-V. The shim can be adapted to other OSes 
and virtualization technologies, e.g., to support Linux and 
Xen, one can reimplement it as a Linux network queuing 
discipline module. For ease of experimentation, the logic 
to adapt rates is built in user space whereas the filters on 
the send side and the packet processing on the receive 
side are implemented in kernel space. 


Clocking rate limiters: The prototype uses software- 
based token bucket filters to limit the rate of each tunnel. 
Implementing software rate limiters that work correctly 
and efficiently at high rates (e.g., 100s of Mbps) requires 
high precision interrupts; which are not widely available 


to drivers. Instead, we built a simple high precision clock. 
One core, per rack of servers, stays in a busy loop, and 
broadcasts a UDP heartbeat packet with the current time 
to all the servers within that rack once every 0.1ms; the 
shim layers use these packets to clock their rate limiters. 
We built a roughly equivalent window-based version of 
the Seawall shim as proof-of-concept. Windowing is easier 
to engineer, since it is self-clocking and does not require 
high precision timers, but incurs the expense of more 
frequent feedback packets (e.g., once every 10 packets). 


Bit-stealing and stateless offload compatibility: A 
practical concern is the need to be compatible with NIC 
offloads. In particular, adding an extra packet header to 
support Seawall prevents the use of widely-used NIC of- 
floads, such as large send offload (LSO) and receive side 
coalescing (RSC) which only work for known packet for- 
mats such as UDP or TCP. This leads to increased CPU 
overhead and decreased throughput. On a quad core 2.66 
Intel Core2 Duo with an Intel 82567LM NIC, sending at 
the line rate of 1Gbps requires 20% more CPU without 
LSO (net: 30% without vs 10% with LSO) [37]. 

NIC vendors have plans to improve offload support for 
generic headers. To be immediately deployable without 
performance degradation, Seawall steals bits from existing 
packet headers, that is, it encodes information in parts 
of the packet that are unused or predictable and hence 
can be restored by the shim at the receiver. For both 
UDP and TCP, Seawall uses up to 16 bits from the IP ID 
field, reserving the lower order bits for the segmentation 
hardware if needed. For TCP packets, Seawall repurposes 
the timestamp option: it compresses the option Kind and 
Length fields from 16 bits down to | bit, leaving the rest 
for Seawall data. In virtualized environments, guest OSes 
are para-virtualized to always include timestamp options. 
The feedback is sent out-of-band in separate packets. We 
also found bit-stealing easier to engineer than adding 
extra headers, which could easily lead to performance 
degradation unless buffers were managed carefully. 


Offloading rate limiters and direct I/O: A few emerg- 
ing standards to improve network I/O performance, such 
as Direct I/O and SR-IOV, let guest VMs bypass the vir- 
tual switch and exchange packets directly with the NIC. 
But, this also bypasses the Seawall shim. Below, we pro- 
pose a few ways to restore compatibility. However, we 
note that the loss of the security and manageability fea- 
tures provided by the software virtual switch has limited 
the deployment of direct I/O NICs in public clouds. To 
encourage deployment, vendors of such NICs plan to 
support new features specific to datacenters. 

By offloading token bucket- and window-based lim- 
iters from the virtual switch to NIC or switch hardware, 
tenant traffic can be controlled even if guest VMs di- 
rectly send packets to the hardware. To support Seawall, 
such offloaded rate limiters need to provide the same 
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granularity of flow classification (entity to entity tunnels) 
as the shim and report usage and congestion statistics. 
High end NICs that support stateful TCP, iSCSI, and 
RDMA offloads already support tens of thousands to mil- 
lions of window-control engines in hardware. Since most 
such NICs are programmable, they can likely support the 
changes needed to return statistics to Seawall. Switch po- 
licers have similar scale and expressiveness properties. In 
addition, low cost programmable switches can be used to 
monitor the network for violations [38]. Given the diver- 
sity of implementation options, we believe that the design 
point occupied by Seawall, i.e., using rate- or window- 
controllers at the network edge, is feasible now and as 
data rates scale up. 


6. EVALUATION 


We ran a series of experiments using our prototype to 
show that Seawall achieves line rate with minimal CPU 
overhead, scales to typical data centers, converges to net- 
work allocations that are agnostic to communications pat- 
tern (1.e., number of flows and destinations) and protocol 
mix (i.e., UDP and TCP), and provides performance isola- 
tion. Through experiments with web workloads, we also 
demonstrate how Seawall can protect cloud-hosted ser- 
vices against DoS attacks, even those using UDP floods. 

All experiments used the token bucket filter-based shim 

(1.e., rate limiter), which is our best-performing prototype 
and matches commonly-available hardware rate limiters. 
The following hold unless otherwise stated: (1) Seawall 
was configured with the default parameters specified in §4, 
(2) all results were aggregated from 10 two minute runs, 
with each datapoint a 15 second average and error bars 
indicating the 95% confidence interval. 
Testbed: For our experiments, we used a 60 server cluster 
spread over three racks with 20 servers per rack. The 
physical machines were equipped with Xeon L5520 2.27 
GHz CPUs (quad core, two hyperthreads per core), Intel 
82576 NICs, and 4GB of RAM. The NIC access links 
were 1Gb/s and the links from the ToR switches up to 
the aggregation switch were 10Gb/s. There was no over- 
subscription within each rack. The ToR uplinks were 1:4 
over-subscribed. We chose this topology because it is 
representative of typical data centers. 

For virtualization, we use Windows Server 2008R2 
Hyper-V with Server 2008R2 VMs. This version of 
Hyper-V exploits the Nehalem virtualization optimiza- 
tions, but does not use the direct I/O functionality on the 
NICs. Each guest VM was provisioned with 1.5 GB of 
RAM and 4 virtual CPUs. 


6.1 Microbenchmarks 


6.1.1 Throughput and overhead 


To evaluate the performance and overhead of Seawall, 
we measured the throughput and CPU overhead of tunnel- 
ing a TCP connection between two machines through the 
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Throughput | CPU @ Sender | CPU @ Receiver 
(Mb/s) (%) (%) 
Seawall 947+ 9 20.7 + 0.6 14.2 0A 
NDIS 977 +4 18.7 a 0.4 13.52 1.1 
Baseline 979 +6 16.9+1.9 10.8+0.8 
Table 1: CPU overhead comparison of Seawall, a null 


NDIS driver, and an unmodified network stack. Seawall 
achieved line rate with low overhead. 


shim. To minimize extraneous sources of noise, no other 
traffic was present in the testbed during each experiment 
and the sender and receiver transferred data from and to 
memory. 

Seawall achieved nearly line rate at steady state, with 
negligible increase in CPU utilization, adding 3.8% at the 
sender and 3.4% at the receiver (Table 1). Much of this 
overhead was due to the overhead from installing a NDIS 
filter driver: the null NDIS filter by itself added 1.8% and 
2.7% overhead, respectively. The NDIS framework is 
fairly light weight since it runs in the kernel and requires 
no protection domain transfers. 

Subtracting out the contributions from the NDIS filter 
driver reveals the overheads due to Seawall: it incurred 
slightly more overhead on the sender than the receiver. 
This is expected since the sender does more work: on 
receiving packets, a Seawall receiver need only buffer 
congestion information and bounce it back to the sender, 
while the sender incurs the overhead of rate limiting and 
may have to merge congestion information from many 
destinations. 

Seawall easily scales to today’s data centers. The shim at 
each node maintains a rate limiter, with a few KBs of state 
each, for every pair of communicating entities terminating 
at that node. The per-packet cost on the data path is fixed 
regardless of data center size. A naive implementation of 
the rate controller incurs O(DL) complexity per sending 
entity (VM or task) where D is the number of destinations 
the VM communicates with and L is the number of links 
on paths to those destinations. In typical data center 
topologies, the diameter is small, and serves as an upper 
bound for L. All network stacks on a given node have 
collective state and processing overheads that grow at 
least linearly with D; these dominate the corresponding 
contributions from the rate controller and shim. 


6.1.2  Traffic-agnostic network allocation 


Seawall seeks to control the network share obtained by a 
sender, regardless of traffic. In particular, a sender should 
not be able to attain bandwidth beyond that allowed by 
the configured weight, no matter how it varies protocol 
type, number of flows, and number of destinations. 

To evaluate the effectiveness of Seawall in achieving this 
goal, we set up the following experiment. Two physical 
nodes, hosting one VM each, served as the sources, with 
one VM dedicated to selfish traffic and the other to well- 
behaved traffic. One physical node served as the sink for 
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Figure 9: Seawall ensures that despite using full burst 
UDP flows or many TCP flows, the share of a selfish user is 
held proportional to its weight. (In (b), the bars show total 
throughput, with the fraction below the divider correspond- 
ing to selfish traffic and the fraction above corresponding to 
well-behaved traffic.) 


all traffic; it was configured with two VMs, with one VM 
serving as the sink for well-behaved traffic and the other 
serving as the sink for selfish traffic. 

Both well-behaved and selfish traffic used the same 
number of source VMs, with all Seawall senders assigned 
the same network weight. The well-behaved traffic con- 
sisted of a single long-lived TCP flow from each source, 
while the selfish traffic used one of three strategies to 
achieve a higher bandwidth share: using full burst UDP 
flow, using large numbers of TCP flows, and using many 
destinations 
Selfish traffic = Full-burst UDP: Figure 9(a) shows the 
aggregate bandwidth achieved by the well-behaved traf- 
fic (long-lived TCP) when the selfish traffic consisted 
of full rate UDP flows. The sinks for well-behaved and 
selfish traffic were colocated on a node with a single 
1Gbps NIC. Because each sender had equal weight, Sea- 
wall assigned half of this capacity to each sender. Without 
Seawall, selfish traffic overwhelms well-behaved traffic, 
leading to negligible throughput for well-behaved traffic. 
By bundling the UDP traffic inside a tunnel that imposed 
congestion control, Seawall ensured that well-behaved traf- 
fic retained reasonable performance. 

Selfish traffic = Many TCP flows: Figure 9(b) shows the 
bandwidth shares achieved by selfish and well-behaved 
traffic when selfish senders used many TCP flows. As 
before, well-behaved traffic ideally should have achieved 
- of the bandwidth. When selfish senders used the same 
number of flows as well-behaved traffic, bandwidth was 
divided evenly (left pair of bars). In runs without Seawall, 
selfish senders that used twice as many flows obtained 
= "rds the bandwidth because TCP congestion control di- 
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Figure 10: By combining feedback from multiple desti- 
nations, Seawall ensures that the share of a sender remains 
independent of the number of destinations it communicates 
with. (The fraction of the bar below the divider corresponds 
to the fraction of bottleneck throughput achieved by selfish 
traffic.) 


vided bandwidth evenly across flows (middle pair of bars). 
Runs with Seawall resulted in approximately even band- 
width allocation. Note that Seawall achieved slightly lower 
throughput in aggregate. This was due to slower recovery 
after loss— the normal traffic had one sawtooth per TCP 
flow whereas Seawall had one per source VM; we believe 
this can be improved using techniques from §4. When 
the selfish traffic used 66 times more flows, it achieved 
a dominant share of bandwidth; the well-behaved traf- 
fic was allocated almost no bandwidth (rightmost pair of 
bars). We see that despite the wide disparity in number of 
flows, Seawall divided bandwidth approximately evenly. 
Again, Seawall improved the throughput of well-behaved 
traffic (the portion above the divider) by several orders of 
magnitude. 

Selfish traffic = Arbitrarily many destinations: This 
experiment evaluated Seawall’s effectiveness against self- 
ish tenants that opened connections to many destinations. 
The experiment used a topology similar to that in Figure 7. 
A well-behaved sender VM and a selfish sender VM were 
located on the same server. Each sink was a VM and ran 
on a separate, dedicated machine. The well-behaved traf- 
fic was assigned one sink machine and the selfish traffic 
was assigned a variable number of sink machines. Both 
well-behaved and selfish traffic consisted of one TCP flow 
per sink. As before, the sending VMs were configured 
with the same weight, so that well-behaved traffic would 
achieve an even share of the bottleneck. 

Figure 10 plots the fraction of bottleneck bandwidth 
achieved by well-behaved traffic with and without Seawall. 
We see that without Seawall, the share of the selfish traffic 
was proportional to the number of destinations. With 
Seawall, the share of the well-behaved traffic remained 
constant at approximately half, independent of the number 
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Figure 11: Despite bandwidth pressure, Seawall ensures 
that the average HTTP request latency remains small with- 
out losing throughput. 


of destinations. 


6.2 Performance isolation for web servers 


To show that Seawall protects against performance in- 
terference similar to that shown in §2, we evaluated the 
achieved level of protection against a DoS attack on a 
web server. Since cloud datacenters are often used to host 
web-accessible services, this 1s a common use case. 

In this experiment, an attacker targeted the HTTP re- 
sponses sent from the web server to its clients. To launch 
such attacks, an adversary places a source VM and a 
sink VM such that traffic between these VMs crosses the 
same bottleneck links as the web server. The source VM 
is close to the server, say on the same rack or machine, 
while the sink VM is typically on another rack. Depend- 
ing on where the sink is placed, the attack can target the 
ToR uplink or another link several hops away. 

All machines were colocated on the same rack. The 
web server VM, running Microsoft IIS 7, and attacker 
source VM, generating UDP floods, resided in separate, 
dedicated physical machines. A single web client VM 
requested data from the server and shared a physical ma- 
chine with an attacker sink VM. The web clients used 
WcAsync to generate well-formed web sessions. Session 
arrivals followed a Poisson process and were exponen- 
tially sized with a mean of 10 requests. Requests followed 
a WebStone distribution, varying in size from 500B re- 
sponses to SMB responses with smaller files being much 
more popular. 

As expected, a full-rate UDP attack flood caused con- 
gestion on the access link of the web client, reducing 
throughput to close to zero and substantially increasing 
latency. With Seawall, the web server behaved as if there 
were no attack. To explore data points where the access 
link was not overwhelmed, we dialed down the UDP at- 
tack rate to 700Mbps, enough to congest the link but not 
to stomp out the web server’s traffic. While achieving 
roughly the same throughput as in the case of no protec- 
tion, Seawall improved the latency observed by web traffic 
by almost 50% (Figure 11). This is because sending the 
attack traffic through a congestion controlled tunnel en- 
sured that the average queue size at the bottleneck stays 
small, thereby reducing queuing delays. 


7. DISCUSSION 


Here, we discuss how Seawall can be used to imple- 
ment rich cloud service models that provide bandwidth 
guarantees to tenants, the implications of our architectural 
decisions given trends in data centers and hardware, and 
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the benefits of jointly modifying senders and receivers to 
achieve new functionality in data center networks. 


7.1 Sharing policies 


Virtual Data Centers (VDCs) have been proposed [20, 
17, 40] as a way to specify tenant networking require- 
ments in cloud data centers. VDCs seek to approximate, 
in terms of security isolation and performance, a dedi- 
cated data center for each tenant and allows tenants to 
specify SLA constraints on network bandwidth at per-port 
and per-source/dest-pair granularities. When allocating 
tenant VMs to physical hardware, the data center fabric 
simultaneously satisfies the specified constraints while 
optimizing node and network utilization. 

Though Seawall policies could be seen as a simpler- 
to-specify alternative to VDCs that closely matches the 
provisioning knobs (e.g., disk, CPU, and memory size) of 
current infrastructure clouds, Seawall’s weight-based poli- 
cies can enhance VDCs in several ways. Some customers, 
through analysis or operational experience, understand 
the traffic requirements of their VMs; VDCs are attrac- 
tive since they can exploit such detailed knowledge to 
achieve predictable performance. To improve VDCs with 
Seawall, the fabric uses weights to implement the hard 
bandwidth guarantees specified in the SLA: with appro- 
priate weights, statically chosen during node- and path- 
placement, Seawall will converge to the desired allocation. 
Unlike implementations based on static reservations [17], 
the Seawall implementation is work-conserving, max-min 
fair, and achieves higher utilization through statistical 
multiplexing. 

Seawall also improves a tenant’s control of its own VDC. 
Since Seawall readily accepts dynamic weight changes, 
each tenant can adjust its allocation policy at a fine gran- 
ularity in response to changing application needs. The 
fabric permits tenants to reallocate weights between differ- 
ent tunnels so long as the resulting weight does not exceed 
the SLA; this prevents tenants from stealing service and 
avoids having to rerun the VM placement optimizer. 


7.2 System architecture 


Topology assumptions: The type of topology and avail- 
able bandwidth affects the complexity requirements of 
network sharing systems. In full bisection bandwidth 
topologies, congestion can only occur at the core. System 
design is simplified [44, 40, 30], since fair shares can be 
computed solely from information about edge congestion, 
without any topology information or congestion feedback 
from the core. 

Seawall supports general topologies, allowing it to pro- 
vide benefits even in legacy or cost-constrained data cen- 
ters networks. Such topologies are typically bandwidth- 
constrained in the core; all nodes using a given core link 
need to be accounted for to achieve fair sharing, band- 
width reservations, and congestion control. Seawall ex- 
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plicitly uses topology information in its control layer to 
prevent link over-utilization. 
Rate limiters and control loops: Using more rate lim- 
iters enables a network allocation system to support richer, 
more granular policies. Not having enough rate limiters 
can result in aliasing. For instance, VM misbehavior can 
cause Gatekeeper [40] to penalize unrelated VMs sending 
to the same destination. Using more complex rate lim- 
iters can improve system performance. For instance, rate 
limiters based on multi-queue schedulers such as DWRR 
or Linux’s hierarchical queuing classes can utilize the 
network more efficiently when rate limiter parameters 
and demand do not match, and the self-clocking nature 
of window-based limiters can reduce switch buffering re- 
quirements as compared to rate-based limiters. However, 
having a large number of complex limiters can constrain 
how a network sharing architecture can be realized, since 
NICs and switches do not currently support such rate 
limiters at scale. 

To maximize performance and policy expressiveness, 
a network allocation system should support a large num- 
ber of limiters of varying capability. The current Seawall 
architecture can support rate- and window-based limiters 
based in hardware and software. As future work, we are 
investigating ways to map topology information onto hi- 
erarchical limiters; to compile policies given a limited 
number of available hardware limiters; and to tradeoff 
rate limiter complexity with controller complexity, us- 
ing longer adaptation intervals when more capable rate 
limiters are available. 


7.3. Partitioning sender/receiver functionality 


Control loops can benefit from receiver-side informa- 
tion and coordination, since the receiver is aware of the 
current traffic demand from all sources and can send feed- 
back to each with lower overhead. Seawall currently uses 
a receiver-driven approach customized for map-reduce to 
achieve better network scheduling; as future work we are 
building a general solution at the shim layer. 

In principal, a purely receiver-directed approach to im- 
plementing a new network allocation policy, such as that 
used in [44, 40], might reduce system complexity since 
the sender TCP stack does not need to be modified. How- 
ever, virtualization stack complexity does not decrease 
substantially, since the rate controller simply moves from 
the sender to the receiver. Moreover, limiting changes to 
one endpoint in data centers provides little of the adoption 
cost advantages found in the heterogeneous Internet envi- 
ronment. Modifying the VMs to defer congestion control 
to other layers can help researchers and practitioners to 
identify and deploy new network sharing policies and 
transport protocols for the data center. 

A receiver-only approach can also add complexity. 
While some allocation policies are easy to attain by 
treating the sender as a black box, others are not. For 


instance, eliminating fatesharing from Gatekeeper and 
adding weighted, fair work-conserving scheduling ap- 
pears non-trivial. Moreover, protecting a receiver-only 
approach from attack requires adding a detector for non- 
conformant senders. While such detectors have been stud- 
ied for WAN traffic [13], it is unclear whether they are 
feasible in the data center. Such detectors might also per- 
mit harmful traffic that running new, trusted sender-side 
code can trivially exclude. 


8. RELATED WORK 


Proportional allocation of shared resources has been 
a recurring theme in the architecture and virtualization 
communities [42, 15]. To the best of our knowledge, 
Seawall is the first to extend this to the data center network 
and support generic sending entities (VMs, applications, 
tasks, processes, etc.). 

Multicast congestion control [14], while similar at first 
blush, targets a very different problem since they have to 
allow for any participant to send traffic to the group while 
ensuring TCP-friendliness. It is unclear how to adapt 
these schemes to proportionally divide the network. 

Recent work in hypervisor, network stack, and soft- 
ware routers have shown that software-based network 
processing, like that used in Seawall for monitoring and 
rate limiting, can be more flexible than hardware-based 
approaches yet achieve high performance. [35] presents 
an optimized virtualization stack that achieves compara- 
ble performance to direct I/O. The Sun Crossbow network 
stack provides an arbitrary number of bandwidth-limited 
virtual NICs [41]. Crossbow provides identical semantics 
regardless of underlying physical NIC and transparently 
leverages offloads to improve performance. Seawall’s us- 
age of rate limiters can benefit from these ideas. 

OCN is an emerging Ethernet standard for congestion 
control in datacenter networks [29]. In QCN, upon de- 
tecting a congested link, the switch sends feedback to the 
heavy senders. The feedback packet uniquely identifies 
the flow and congestion location, enabling senders that 
receive feedback to rate limit specific flows. QCN uses 
explicit feedback to drive a more aggressive control loop 
than TCP. While QCN can throttle the heavy senders, it 
is not designed to provide fairness guarantees, tunable 
or otherwise. Further, QCN requires changes to switch 
hardware and can only cover purely Layer 2 topologies. 

Much work has gone into fair queuing mechanisms in 
switches [12]. Link local sharing mechanisms, such as 
Weighted Fair Queuing and Deficit Round Robin, sepa- 
rate traffic into multiple queues at each switch port and 
arbitrate service between the queues in some priority or 
proportion. NetShare [24] builds on top of WFQ support 
in switches. This approach is useful to share the network 
between a small number of large sending entities (e.g., 
a whole service type, such as “Search” or “Distributed 
storage” in a platform data center). The number of queues 
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available in today’s switches, however, is several orders 
of magnitude smaller than the numbers of VMs and tasks 
in today’s datacenters. More fundamentally, since link 
local mechanisms lack end-to-end information they can 
let significant traffic through only to be dropped at some 
later bottleneck on the path. Seawall can achieve better 
scalability by mapping many VMs onto a small, fixed 
number of queues and achieves better efficiency by using 
end-to-end congestion control. 


9. FINAL REMARKS 


Economies of scale are pushing distributed applica- 
tions to co-exist with each other on shared infrastructure. 
The lack of mechanisms to apportion network bandwidth 
across these entities leads to a host of problems, from re- 
duced security to unpredictable performance and to poor 
ability to improve high level objectives such as job com- 
pletion time. Seawall is a first step towards providing data 
center administrators with tools to divide their network 
across the sharing entities without requiring any coopera- 
tion from the entities. It is novel in its ability to scale to 
massive numbers of sharing entities and uniquely adapts 
ideas from congestion control to the problem of enforcing 
network share agnostic to traffic type. The design space 
that Seawall occupies — push functionality to software at 
the network edge — appears well-suited to emerging hard- 
ware trends in data center and virtualization hardware. 
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Notes 


'Perhaps because it is hard to predict such events and find 
appropriate tasks at short notice. Also, running more tasks 
requires spare memory and has initialization overhead. 
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Abstract 


We consider the problem of fair resource allocation 
in a system containing different resource types, where 
each user may have different demands for each resource. 
To address this problem, we propose Dominant Resource 
Fairness (DRF), a generalization of max-min fairness 
to multiple resource types. We show that DRF, unlike 
other possible policies, satisfies several highly desirable 
properties. First, DRF incentivizes users to share re- 
sources, by ensuring that no user is better off if resources 
are equally partitioned among them. Second, DRF is 
strategy-proof, as a user cannot increase her allocation 
by lying about her requirements. Third, DRF is envy- 
free, as no user would want to trade her allocation with 
that of another user. Finally, DRF allocations are Pareto 
efficient, as it is not possible to improve the allocation of 
a user without decreasing the allocation of another user. 
We have implemented DRF in the Mesos cluster resource 
manager, and show that it leads to better throughput and 
fairness than the slot-based fair sharing schemes in cur- 
rent cluster schedulers. 


1 Introduction 


Resource allocation is a key building block of any shared 
computer system. One of the most popular allocation 
policies proposed so far has been max-min fairness, 
which maximizes the minimum allocation received by a 
user in the system. Assuming each user has enough de- 
mand, this policy gives each user an equal share of the 
resources. Max-min fairness has been generalized to in- 
clude the concept of weight, where each user receives a 
share of the resources proportional to its weight. 

The attractiveness of weighted max-min fairness 
stems from its generality and its ability to provide perfor- 
mance isolation. The weighted max-min fairness model 
can support a variety of other resource allocation poli- 
cies, including priority, reservation, and deadline based 
allocation [31]. In addition, weighted max-min fairness 
ensures isolation, in that a user is guaranteed to receive 


her share irrespective of the demand of the other users. 

Given these features, it should come as no surprise 
that a large number of algorithms have been proposed 
to implement (weighted) max-min fairness with various 
degrees of accuracy, such as round-robin, proportional 
resource sharing [32], and weighted fair queueing [12]. 
These algorithms have been applied to a variety of re- 
sources, including link bandwidth [8, 12, 15, 24, 27, 29], 
CPU [11, 28, 31], memory [4, 31], and storage [5]. 

Despite the vast amount of work on fair allocation, the 
focus has so far been primarily on a single resource type. 
Even in multi-resource environments, where users have 
heterogeneous resource demands, allocation is typically 
done using a single resource abstraction. For example, 
fair schedulers for Hadoop and Dryad [1, 18, 34], two 
widely used cluster computing frameworks, allocate re- 
sources at the level of fixed-size partitions of the nodes, 
called slots. This is despite the fact that different jobs 
in these clusters can have widely different demands for 
CPU, memory, and I/O resources. 

In this paper, we address the problem of fair alloca- 
tion of multiple types of resources to users with heteroge- 
neous demands. In particular, we propose Dominant Re- 
source Fairness (DRF), a generalization of max-min fair- 
ness for multiple resources. The intuition behind DRF is 
that in a multi-resource environment, the allocation of a 
user should be determined by the user’s dominant share, 
which is the maximum share that the user has been allo- 
cated of any resource. In a nutshell, DRF seeks to max- 
imize the minimum dominant share across all users. For 
example, if user A runs CPU-heavy tasks and user B runs 
memory-heavy tasks, DRF attempts to equalize user A’s 
share of CPUs with user 6’s share of memory. In the 
single resource case, DRF reduces to max-min fairness 
for that resource. 

The strength of DRF lies in the properties it satis- 
fies. These properties are trivially satisfied by max-min 
fairness for a single resource, but are non-trivial in the 
case of multiple resources. Four such properties are 
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sharing incentive, strategy-proofness, Pareto efficiency, 
and envy-freeness. DRF provides incentives for users to 
share resources by guaranteeing that no user is better off 
in a system in which resources are statically and equally 
partitioned among users. Furthermore, DRF is strategy- 
proof, as a user cannot get a better allocation by lying 
about her resource demands. DRF is Pareto-efficient as 
it allocates all available resources subject to satisfying 
the other properties, and without preempting existing al- 
locations. Finally, DRF is envy-free, as no user prefers 
the allocation of another user. Other solutions violate at 
least one of the above properties. For example, the pre- 
ferred [3, 22, 33] fair division mechanism in microeco- 
nomic theory, Competitive Equilibrium from Equal In- 
comes [30], is not strategy-proof. 

We have implemented and evaluated DRF in 
Mesos [16], a resource manager over which multiple 
cluster computing frameworks, such as Hadoop and MPI, 
can run. We compare DRF with the slot-based fair shar- 
ing scheme used in Hadoop and Dryad and show that 
slot-based fair sharing can lead to poorer performance, 
unfairly punishing certain workloads, while providing 
weaker isolation guarantees. 

While this paper focuses on resource allocation in dat- 
acenters, we believe that DRF is generally applicable to 
other multi-resource environments where users have het- 
erogeneous demands, such as in multi-core machines. 

The rest of this paper is organized as follows. Sec- 
tion 2 motivates the problem of multi-resource fairness. 
Section 3 lists fairness properties that we will consider in 
this paper. Section 4 introduces DRF. Section 5 presents 
alternative notions of fairness, while Section 6 analyzes 
the properties of DRF and other policies. Section 7 pro- 
vides experimental results based on traces from a Face- 
book Hadoop cluster. We survey related work in Sec- 
tion 8 and conclude in Section 9. 


2 Motivation 


While previous work on weighted max-min fairness has 
focused on single resources, the advent of cloud com- 
puting and multi-core processors has increased the need 
for allocation policies for environments with multiple 
resources and heterogeneous user demands. By multi- 
ple resources we mean resources of different types, in- 
stead of multiple instances of the same interchangeable 
resource. 

To motivate the need for multi-resource allocation, we 
plot the resource usage profiles of tasks in a 2000-node 
Hadoop cluster at Facebook over one month (October 
2010) in Figure 1. The placement of a circle in Figure 1 
indicates the memory and CPU resources consumed by 
tasks. The size of a circle is logarithmic to the number of 
tasks in the region of the circle. Though the majority of 
tasks are CPU-heavy, there exist tasks that are memory- 


NSDI 711: 8th USENIX Symposium on Networked Systems Design and Implementation 


Per task CPU demand (cores) 





0 1 2 3 4 +S 6 7 
Per task memory demand (GB) 


Figure 1: CPU and memory demands of tasks in a 2000-node 
Hadoop cluster at Facebook over one month (October 2010). 
Each bubble’s size is logarithmic in the number of tasks in its 
region. 


heavy as well, especially for reduce operations. 

Existing fair schedulers for clusters, such as Quincy 
[18] and the Hadoop Fair Scheduler [2, 34], ignore the 
heterogeneity of user demands, and allocate resources at 
the granularity of slots, where a slot is a fixed fraction 
of a node. This leads to inefficient allocation as a slot is 
more often than not a poor match for the task demands. 

Figure 2 quantifies the level of fairness and isola- 
tion provided by the Hadoop MapReduce fair sched- 
uler [2, 34]. The figure shows the CDFs of the ratio 
between the task CPU demand and the slot CPU share, 
and of the ratio between the task memory demand and 
the slot memory share. We compute the slot memory 
and CPU shares by simply dividing the total amount of 
memory and CPUs by the number of slots. A ratio of 
1 corresponds to a perfect match between the task de- 
mands and slot resources, a ratio below 1 corresponds to 
tasks underutilizing their slot resources, and a ratio above 
1 corresponds to tasks over-utilizing their slot resources, 
which may lead to thrashing. Figure 2 shows that most of 
the tasks either underutilize or overutilize some of their 
slot resources. Modifying the number of slots per ma- 
chine will not solve the problem as this may result either 
in a lower overall utilization or more tasks experiencing 
poor performance due to over-utilization (see Section 7). 


3 Allocation Properties 


We now turn our attention to designing a max-min fair al- 
location policy for multiple resources and heterogeneous 
requests. To illustrate the problem, consider a system 
consisting of 9 CPUs and 18 GB RAM, and two users: 
user A runs tasks that require (1 CPUs, 4 GB) each, and 
user B runs tasks that require (3 CPUs, 1 GB) each. 
What constitutes a fair allocation policy for this case? 
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Figure 2: CDF of demand to slot ratio in a 2000-node cluster at 
Facebook over a one month period (October 2010). A demand 
to slot ratio of 2.0 represents a task that requires twice as much 
CPU (or memory) than the slot CPU (or memory) size. 


One possibility would be to allocate each user half of 
every resource. Another possibility would be to equal- 
ize the aggregate (i.e., CPU plus memory) allocations of 
each user. While it is relatively easy to come up with a 
variety of possible “fair” allocations, it is unclear how to 
evaluate and compare these allocations. 

To address this challenge, we start with a set of de- 
sirable properties that we believe any resource alloca- 
tion policy for multiple resources and heterogeneous de- 
mands should satisfy. We then let these properties guide 
the development of a fair allocation policy. We have 
found the following four properties to be important: 


1. Sharing incentive: Each user should be better off 
sharing the cluster, than exclusively using her own 
partition of the cluster. Consider a cluster with iden- 
tical nodes and n users. Then a user should not be 
able to allocate more tasks in a cluster partition con- 
sisting of i. of all resources. 


2. Strategy-proofness: Users should not be able to 
benefit by lying about their resource demands. This 
provides incentive compatibility, as a user cannot 
improve her allocation by lying. 


3. Envy-freeness: A user should not prefer the allo- 
cation of another user. This property embodies the 
notion of fairness [13, 30]. 


4. Pareto efficiency: It should not be possible to in- 
crease the allocation of a user without decreasing 
the allocation of at least another user. This prop- 
erty 1s important as it leads to maximizing system 
utilization subject to satisfying the other properties. 


We briefly comment on the strategy-proofness and 
sharing incentive properties, which we believe are of 
special importance in datacenter environments. Anec- 
dotal evidence from cloud operators that we have talked 


with indicates that strategy-proofness is important, as it 
is common for users to attempt to manipulate schedulers. 
For example, one of Yahoo!’s Hadoop MapReduce dat- 
acenters has different numbers of slots for map and re- 
duce tasks. A user discovered that the map slots were 
contended, and therefore launched all his jobs as long 
reduce phases, which would manually do the work that 
MapReduce does in its map phase. Another big search 
company provided dedicated machines for jobs only if 
the users could guarantee high utilization. The company 
soon found that users would sprinkle their code with in- 
finite loops to artificially inflate utilization levels. 

Furthermore, any policy that satisfies the sharing in- 
centive property also provides performance isolation, as 
it guarantees a minimum allocation to each user (i.e., a 
user cannot do worse than owning “ of the cluster) irre- 
spective of the demands of the other users. 

It can be easily shown that in the case of a single re- 
source, max-min fairness satisfies all the above proper- 
ties. However, achieving these properties in the case 
of multiple resources and heterogeneous user demands 
is not trivial. For example, the preferred fair division 
mechanism in microeconomic theory, Competitive Equi- 
librium from Equal Incomes [22, 30, 33], is not strategy- 
proof (see Section 6.1.2). 

In addition to the above properties, we consider four 
other nice-to-have properties: 


e Single resource fairness: For a single resource, the 
solution should reduce to max-min fairness. 


e Bottleneck fairness: If there is one resource that is 
percent-wise demanded most of by every user, then 
the solution should reduce to max-min fairness for 
that resource. 


e Population monotonicity: When a user leaves the 
system and relinquishes her resources, none of the 
allocations of the remaining users should decrease. 


e Resource monotonicity: If more resources are added 
to the system, none of the allocations of the existing 
users should decrease. 


4 Dominant Resource Fairness (DRF) 


We propose Dominant Resource Fairness (DRF), a new 
allocation policy for multiple resources that meets all 
four of the required properties in the previous section. 
For every user, DRF computes the share of each resource 
allocated to that user. The maximum among all shares 
of a user is called that user’s dominant share, and the 
resource corresponding to the dominant share is called 
the dominant resource. Different users may have dif- 
ferent dominant resources. For example, the dominant 
resource of a user running a computation-bound job is 
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Figure 3: DRF allocation for the example in Section 4.1. 


CPU, while the dominant resource of a user running an 
I/O-bound job is bandwidth.! DRF simply applies max- 
min fairness across users’ dominant shares. That is, DRF 
seeks to maximize the smallest dominant share in the 
system, then the second-smallest, and so on. 

We start by illustrating DRF with an example (84.1), 
then present an algorithm for DRF (84.2) and a defini- 
tion of weighted DRF (84.3). In Section 5, we present 
two other allocation policies: asset fairness, a straightfor- 
ward policy that aims to equalize the aggregate resources 
allocated to each user, and competitive equilibrium from 
equal incomes (CEEI), a popular fair allocation policy 
preferred in the micro-economic domain [22, 30, 33]. 

In this section, we consider a computation model with 
nm users and m resources. Each user runs individual tasks, 
and each task is characterized by a demand vector, which 
specifies the amount of resources required by the task, 
e.g., (1 CPU, 4 GB). In general, tasks (even the ones 
belonging to the same user) may have different demands. 


4.1 An Example 


Consider a system with of 9 CPUs, 18 GB RAM, and two 
users, where user A runs tasks with demand vector (1 
CPU, 4 GB), and user B runs tasks with demand vector 
(3 CPUs, 1 GB) each. 

In the above scenario, each task from user A consumes 
1/9 of the total CPUs and 2/9 of the total memory, so 
user A’s dominant resource is memory. Each task from 
user B consumes 1/3 of the total CPUs and 1/18 of the 
total memory, so user 6’s dominant resource is CPU. 
DRF will equalize users’ dominant shares, giving the al- 
location in Figure 3: three tasks for user A, with a total 
of (3 CPUs, 12 GB), and two tasks for user B, with a 
total of (6 CPUs, 2 GB). With this allocation, each user 
ends up with the same dominant share, i.e., user A gets 
2/3 of RAM, while user B gets 2/3 of the CPUs. 

This allocation can be computed mathematically as 
follows. Let x and y be the number of tasks allocated 


"A user may have the same share on multiple resources, and might 
therefore have multiple dominant resources. 
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Algorithm 1 DRF pseudo-code 


SAP. Py) > total resource capacities 
C = (c1,°++,Cm) consumed resources, initially 0 
s; (j=1..n) | BP user 2’s dominant shares, initially 0 
Uj; = (ii,**+;Uim) ( = 1..n) > resources given to 


user 7, initially O 


pick user 7 with lowest dominant share s;, 
D,; < demand of user 7’s next task 
ifC + D; < Rthen 


C=C+D,; > update consumed vector 
U; =U; + D; > update 2’s allocation vector 
54 = max?" 1 {Ui,j /t3 
else 
return > the cluster is full 
end if 


by DRF to users A and B, respectively. Then user A 
receives (x CPU, 4% GB), while user B gets (3y CPU, 
y GB). The total amount of resources allocated to both 
users is (x + 3y) CPUs and (4x + y) GB. Also, the dom- 
inant shares of users A and B are 47/18 = 22/9 and 
3y/9 = y/3, respectively (their corresponding shares of 
memory and CPU). The DRF allocation is then given by 
the solution to the following optimization problem: 


max (x, y) (Maximize allocations) 
subject to 
x+3y < 9 (CPU constraint) 
4x -+y < 18 (Memory constraint) 
2 
> = 5 (Equalize dominant shares) 


Solving this problem yields? x = 3 and y = 2. Thus, 
user A gets (3 CPU, 12 GB) and B gets (6 CPU, 2 GB). 

Note that DRF need not always equalize users’ domi- 
nant shares. When a user’s total demand 1s met, that user 
will not need more tasks, so the excess resources will 
be split among the other users, much like in max-min 
fairness. In addition, if a resource gets exhausted, users 
that do not need that resource can still continue receiv- 
ing higher shares of the other resources. We present an 
algorithm for DRF allocation in the next section. 


4.2 DRF Scheduling Algorithm 


Algorithm 1 shows pseudo-code for DRF scheduling. 
The algorithm tracks the total resources allocated to each 
user as well as the user’s dominant share, s;. At each 
step, DRF picks the user with the lowest dominant share 
among those with tasks ready to run. If that user’s task 
demand can be satisfied, i.e., there are enough resources 


*Note that given last constraint (i.e., 27/9 = y/3) allocations x 
and y are simultaneously maximized. 
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Table 1: Example of DRF allocating resources in a system with 9 CPUs and 18 GB RAM to two users running tasks that require 
(1 CPU, 4 GB) and (3 CPUs, 1 GB), respectively. Each row corresponds to DRF making a scheduling decision. A row shows the 
shares of each user for each resource, the user’s dominant share, and the fraction of each resource allocated so far. DRF repeatedly 
selects the user with the lowest dominant share (indicated in bold) to launch a task, until no more tasks can be allocated. 


available in the system, one of her tasks is launched. We 
consider the general case in which a user can have tasks 
with different demand vectors, and we use variable D, to 
denote the demand vector of the next task user 7 wants 
to launch. For simplicity, the pseudo-code does not cap- 
ture the event of a task finishing. In this case, the user 
releases the task’s resources and DRF again selects the 
user with the smallest dominant share to run her task. 

Consider the two-user example in Section 4.1. Table 1 
illustrates the DRF allocation process for this example. 
DRF first picks B to run a task. As a result, the shares 
of B become (3/9, 1/18), and the dominant share be- 
comes max(3/9,1/18) = 1/3. Next, DRF picks A, as 
her dominant share is 0. The process continues until it 
is no longer possible to run new tasks. In this case, this 
happens as soon as CPU has been saturated. 

At the end of the above allocation, user A gets (3 CPU, 
12 GB), while user B gets (6 CPU, 2 GB), i.e., each user 
gets 2/3 of its dominant resource. 

Note that in this example the allocation stops as soon 
as any resource is saturated. However, in the general 
case, it may be possible to continue to allocate tasks even 
after some resource has been saturated, as some tasks 
might not have any demand on the saturated resource. 

The above algorithm can be implemented using a bi- 
nary heap that stores each user’s dominant share. Each 
scheduling decision then takes O(log 1) time for n users. 


4.33 Weighted DRF 


In practice, there are many cases in which allocating re- 
sources equally across users is not the desirable policy. 
Instead, we may want to allocate more resources to users 
running more important jobs, or to users that have con- 
tributed more resources to the cluster. To achieve this 
goal, we propose Weighted DRF, a generalization of both 
DRF and weighted max-min fairness. 

With Weighted DRE, each user 2 is associated a weight 
vector W; = (wi1,---,; Wi,m), where w;,; represents the 
weight of user 7 for resource 7. The definition of a dom- 
inant share for user 7 changes to s; = max;{u;,;/w;,;}, 
where u;,; 18 user 2’s share of resource 7. A particular 


case of interest is when all the weights of user 2 are equal, 
Lé€., Wij = wi, (1 <7 < m). In this case, the ratio be- 
tween the dominant shares of users 2 and 7 will be simply 
w;/w,;. If the weights of all users are set to 1, Weighted 
DRF reduces trivially to DRF. 


5 Alternative Fair Allocation Policies 


Defining a fair allocation in a multi-resource system is 
not an easy question, as the notion of “fairness” is itself 
open to discussion. In our efforts, we considered numer- 
ous allocation policies before settling on DRE as the only 
one that satisfies all four of the required properties in 
Section 3: sharing incentive, strategy-proofness, Pareto 
efficiency, and envy-freeness. In this section, we con- 
sider two of the alternatives we have investigated: Asset 
Fairness, a simple and intuitive policy that aims to equal- 
ize the aggregate resources allocated to each user, and 
Competitive Equilibrium from Equal Incomes (CEE), 
the policy of choice for fairly allocating resources in the 
microeconomic domain [22, 30, 33]. We compare these 
policies with DRF in Section 5.3. 


5.1 Asset Fairness 


The idea behind Asset Fairness is that equal shares of 
different resources are worth the same, i.e., that 1% of 
all CPUs worth is the same as 1% of memory and 1% 
of I/O bandwidth. Asset Fairness then tries to equalize 
the aggregate resource value allocated to each user. In 
particular, Asset Fairness computes for each user 7 the 
ageregate share x; = 5 5 84,99 where s; ; is the share of 
resource 7 given to user 2. It then applies max-min across 
users’ aggregate shares, i.e., it repeatedly launches tasks 
for the user with the minimum aggregate share. 
Consider the example in Section 4.1. Since there are 
twice as many GB of RAM as CPUs (i.e., 9 CPUs and 
18 GB RAM), one CPU is worth twice as much as one 
GB of RAM. Supposing that one GB is worth $1 and 
one CPU is worth $2, it follows that user A spends $6 
for each task, while user B spends $7. Let x and y be 
the number of tasks allocated by Asset Fairness to users 
A and B, respectively. Then the asset-fair allocation is 
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given by the solution to the following optimization prob- 
lem: 


max (x, y) (Maximize allocations) 
subject to 
x+3y < 9 (CPU constraint) 
4x +y <  18(Memory constraint) 
6x =  7y (Every user spends the same) 


Solving the above problem yields x = 2.52 and y = 
2.16. Thus, user A gets (2.5 CPUs, 10.1 GB), while user 
B gets (6.5 CPUs, 2.2 GB), respectively. 

While this allocation policy seems compelling in its 
simplicity, it has a significant drawback: it violates the 
sharing incentive property. As we show in Section 6.1.1, 
asset fairness can result in one user getting less than 1/n 
of all resources, where n is the total number of users. 


5.2 Competitive Equilibrium from Equal Incomes 


In microeconomic theory, the preferred method to fairly 
divide resources is Competitive Equilibrium from Equal 
Incomes (CEE]) [22, 30, 33]. With CEEI, each user re- 
ceives initially 1 of every resource, and subsequently, 
each user trades her resources with other users in a per- 
fectly competitive market.’ The outcome of CEEI is both 
envy-free and Pareto efficient [30]. 

More precisely, the CEEI allocation is given by the 
Nash bargaining solution* [22, 23]. The Nash bargain- 
ing solution picks the feasible allocation that maximizes 
[], wi(ai), where u,;(a;) is the utility that user 7 gets from 
her allocation a;. To simplify the comparison, we assume 
that the utility that a user gets from her allocation is sim- 
ply her dominant share, s;. 

Consider again the two-user example in Section 4.1. 
Recall that the dominant share of user A is 42/18 = 
2x /9 while the dominant share of user B is 3y/9 = y/3, 
where «x is the number of tasks given to A and y is the 
number of tasks given to 6. Maximizing the product 
of the dominant shares is equivalent to maximizing the 
product x - y. Thus, CEEI aims to solve the following 
optimization problem: 


max (x - y) (maximize Nash product) 
subject to 
x+3y <  9(CPU constraint) 
4x +y <  18(Memory constraint) 


Solving the above problem yields x = 45/11 and y = 
18/11. Thus, user A gets (4.1 CPUs, 16.4 GB), while 
user B gets (4.9 CPUs, 1.6 GB). 


>A perfect market satisfies the price-taking (i.e., no single user af- 
fects prices) and market-clearance (i.e., matching supply and demand 
via price adjustment) assumptions. 

*For this to hold, utilities have to be homogeneous, i.e., u(a x) = 
au(x) for a > 0, which is true in our case. 
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Figure 4: Allocations given by DRF, Asset Fairness and CEEI 
in the example scenario in Section 4.1. 


b) Asset Fairness 


Unfortunately, while CEEI is envy-free and Pareto ef- 
ficient, it turns out that it is not strategy-proof, as we will 
show in Section 6.1.2. Thus, users can increase their al- 
locations by lying about their resource demands. 


5.3. Comparison with DRF 


To give the reader an intuitive understanding of Asset 
Fairness and CEEI, we compare their allocations for the 
example in Section 4.1 to that of DRF in Figure 4. 

We see that DRF equalizes the dominant shares of the 
users, i.e., user A’s memory share and user B’s CPU 
share. In contrast, Asset Fairness equalizes the total frac- 
tion of resources allocated to each user, i.e., the areas of 
the rectangles for each user in the figure. Finally, be- 
cause CEEI assumes a perfectly competitive market, it 
finds a solution satisfying market clearance, where ev- 
ery resource has been allocated. Unfortunately, this ex- 
act property makes it possible to cheat CEEI: a user can 
claim she needs more of some underutilized resource 
even when she does not, leading CEEI to give more tasks 
overall to this user to achieve market clearance. 


6 Analysis 


In this section, we discuss which of the properties pre- 
sented in Section 3 are satisfied by Asset Fairness, CEEI, 
and DRF. We also evaluate the accuracy of DRF when 
task sizes do not match the available resources exactly. 


6.1 Fairness Properties 


Table 2 summarizes the fairness properties that are sat- 
isfied by Asset Fairness, CEEI, and DRF. The Appendix 
contains the proofs of the main properties of DRF, while 
our technical report [14] contains a more complete list of 
results for DRF and CEEI. In the remainder of this sec- 
tion, we discuss some of the interesting missing entries 
in the table, i.e., properties violated by each of these dis- 
ciplines. In particular, we show through examples why 
Asset Fairness and CEEI lack the properties that they 
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Table 2: Properties of Asset Fairness, CEEI and DRF. 


do, and we prove that no policy can provide resource 
monotonicity without violating either sharing incentive 
or Pareto efficiency to explain why DRF lacks resource 
monotonicity. 


6.1.1 Properties Violated by Asset Fairness 


While being the simplest policy, Asset Fairness violates 
several important properties: sharing incentive, bottle- 
neck fairness, and resource monotonicity. Next, we use 
examples to show the violation of these properties. 


Theorem 1 Asset Fairness violates the sharing incen- 
tive property. 


Proof Consider the following example, illustrated in 
Figure 5: two users in a system with (30,30) total re- 
sources have demand vectors D, = (1,3), and Dz = 
(1,1). Asset fairness will allocate the first user 6 tasks 
and the second user 12 tasks. The first user will receive 
(6,18) resources, while the second will use (12, 12). 
While each user gets an equal aggregate share of a the 
second user gets less than half (15) of both resources. 
This violates the sharing incentive property, as the sec- 
ond user would be better off to statically partition the 
cluster and own half of the nodes. LI 


Theorem 2 Asset Fairness violates the bottleneck fair- 
ness property. 


Proof Consider a scenario with a total resource vector of 
(21, 21) and two users with demand vectors D; = (3, 2) 
and D2 = (4,1), making resource 1 the bottleneck re- 
source. Asset fairness will give each user 3 tasks, equal- 
izing their aggregate usage to 15. However, this only 
gives the first user 2 of resource 1 (the contended bottle- 


7 
neck resource), violating bottleneck fairness. L| 


Theorem 3 Asset fairness does not satisfy resource 
monotonicity. 
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Figure 5: Example showing that Asset Fairness can fail to meet 
the sharing incentive property. Asset Fairness gives user 2 less 
than half of both resources. 
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Figure 6: Example showing how CEEI violates strategy proof- 
ness. User 1 can increase her share by claiming that she needs 
more of resource 2 than she actually does. 


Proof Consider two users A and B with demands (4, 2) 
and (1,1) and 77 units of two resources. Asset fairness 
allocates A a total of (44,22) and B (33, 33) equalizing 
their sum of shares to oo If resource two is doubled, both 
users’ share of the second resource is halved, while the 
first resource is saturated. Asset fairness now decreases 
A’s allocation to (42,21) and increases B’s to (35, 35), 
equalizing their shares to a2 ay = 2 2 = . 
Thus resource monotonicity is violated. 


6.1.2 Properties Violated by CEEI 


While CEEI is envy-free and Pareto efficient, it turns 
out that it 1s not strategy proof. Intuitively, this is be- 
cause CEEI assumes a perfectly competitive market that 
achieves market clearance, i.e., matching of supply and 
demand and allocation of all the available resources. 
This can lead to CEEI giving much higher shares to users 
that use more of a less-contended resource in order to 
fully utilize that resource. Thus, a user can claim that she 
needs more of some underutilized resource to increase 
her overall share of resources. We illustrate this below. 


Theorem 4 CEE! is not strategy-proof. 
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| Res. | } Res. 2 | 


b) After user 3 
leaves 


Res. 1 Res. 2. 
a) With 3 users 


Figure 7: Example showing that CEEI violates population 
monotonicity. When user 3 leaves, CEEI changes the alloca- 
tion from a) to b), lowering the share of user 2. 


Proof Consider the following example, shown in Figure 
6. Assume a total resource vector of (100, 100), and two 
users with demands (16, 1) and (1, 2). In this case, CEEI 
allocates we and a tasks to each user respectively 
(approximately 3.2 and 48.8 tasks). If user 1 changes her 
demand vector to (16,8), asking for more of resource 
2 than she actually needs, CEEI gives the the users = 
and a tasks respectively (approximately 4.2 and 33.3 
tasks). Thus, user | improves her number of tasks from 
3.2 to 4.2 by lying about her demand vector. User 2 suf- 


fers because of this, as her task allocation decreases. L] 


In addition, for the same intuitive reason (market 
clearance), we have the following result: 


Theorem 5 CEEI violates population monotonicity. 


Proof Consider the total resource vector (100, 100) and 
three users with the following demand vectors D,; = 
(4,1), Dz = (1,16), and D3 = (16,1) (see Figure 7). 
CEEI will yield the allocation Ay = (11.3,5.4,3.1), 
where the numbers in parenthesis represent the number 
of tasks allocated to each user. If user 3 leaves the system 
and relinquishes her resource, CEEI gives the new allo- 
cation Az = (23.8,4.8), which has made user 2 worse 
off than in A. LJ 


6.1.3. Resource Monotonicity vs. Sharing Incentives 
and Pareto efficiency 


As shown in Table 2, DRF achieves all the properties ex- 
cept resource monotonicity. Rather than being a limita- 
tion of DRF, this is a consequence of the fact that sharing 
incentive, Pareto efficiency, and resource monotonicity 
cannot be achieved simultaneously. Since we consider 
the first two of these properties to be more important (see 
Section 3) and since adding new resources to a system is 
a relatively rare event, we chose to satisfy sharing incen- 
tive and Pareto efficiency, and give up resource mono- 
tonicity. In particular, we have the following result. 
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Theorem 6 No allocation policy that satisfies the shar- 
ing incentive and Pareto efficiency properties can also 
satisfy resource monotonicity. 


Proof We use a simple example to prove this prop- 
erty. Consider two users A and B with symmetric de- 
mands (2,1), and (1,2), respectively, and assume equal 
amounts of both resources. Sharing incentive requires 
that user A gets at least half of resource 1 and user B 
gets half of resource 2. By Pareto efficiency, we know 
that at least one of the two users must be allocated more 
resources. Without loss of generality, assume that user A 
is given more than half of resource 1 (a symmetric argu- 
ment holds if user B is given more than half of resource 
2). If the total amount of resource 2 is now increased by 
a factor of 4, user B is no longer getting its guaranteed 
share of half of resource 2. Now, the only feasible allo- 
cation that satisfies the sharing incentive is to give both 
users half of resource 1, which would require decreas- 
ing user 1’s share of resource 1, thus violating resource 
monotonicity. L] 


This theorem explains why both DRF and CEEI vio- 
late resource monotonicity. 


6.2 Discrete Resource Allocation 


So far, we have implicitly assumed one big resource 
pool whose resources can be allocated in arbitrarily small 
amounts. Of course, this is often not the case in prac- 
tice. For example, clusters consist of many small ma- 
chines, where resources are allocated to tasks in discrete 
amounts. In the reminder of this section, we refer to 
these two scenarios as the continuous, and the discrete 
scenario, respectively. We now turn our attention to how 
fairness is affected in the discrete scenario. 

Assume a cluster consisting of AK machines. 


Let max-task denote the maximum demand _ vec- 
tor across all demand vectors, i.e, max-task = 
(max;{d;,1},max;{d;,2},---,max;{dim}). Assume 


further that any task can be scheduled on every machine, 
i.e., the total amount of resources on each machine 
is at least max-task. We only consider the case when 
each user has strictly positive demands. Given these 
assumptions, we have the following result. 


Theorem 7 In the discrete scenario, it is possible to al- 
locate resources such that the difference between the al- 
locations of any two users is bounded by one max-task 
compared to the continuous allocation scenario. 


Proof Assume we start allocating resources on one ma- 
chine at a time, and that we always allocate a task to the 
user with the lowest dominant share. As long as there 
is at least a max-task available on the first machine, we 
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Figure 8: CPU, memory and dominant share for two jobs. 


continue to allocate a task to the next user with least dom- 
inant share. Once the available resources on the first ma- 
chine become less than a max-task size, we move to the 
next machine and repeat the process. When the alloca- 
tion completes, the difference between two user’s alloca- 
tions of their dominant resources compared to the con- 
tinuous scenario 1s at most max-task. If this were not the 
case, then some user A would have more than max-task 
discrepancy w.r.t. to another user 6. However, this can- 
not be the case, because the last time A was allocated a 
task, B should have been allocated a task instead. LI 


7 Experimental Results 


This section evaluates DRF through micro- and macro- 
benchmarks. The former is done through experiments 
running an implementation of DRF in the Mesos cluster 
resource manager [16]. The latter is done using trace- 
driven simulations. 

We start by showing how DRF dynamically adjusts the 
shares of jobs with different resource demands in Section 
7.1. In Section 7.2, we compare DRF against slot-level 
fair sharing (as implemented by Hadoop Fair Scheduler 
[34] and Quincy [18]), and CPU-only fair sharing. Fi- 
nally, in Section 7.3, we use Facebook traces to compare 
DRE and the Hadoop’s Fair Scheduler in terms of utiliza- 


tion and job completion time. 
7.1 Dynamic Resource Sharing 


In our first experiment, we show how DRF dynamically 
shares resources between jobs with different demands. 
We ran two jobs on a 48-node Mesos cluster on Amazon 
EC2, using “extra large” instances with 4 CPU cores and 
15 GB of RAM. We configured Mesos to allocate up to 
4 CPUs and 14 GB of RAM on each node, leaving | GB 
for the OS. We submitted two jobs that launched tasks 
with different resource demands at different times during 
a 6-minute interval. 

Figures 8 (a) and 8 (b) show the CPU and memory al- 
locations given to each job as a function of time, while 
Figure 8 (c) shows their dominant shares. In the first 2 
minutes, job | uses (1 CPU, 10 GB RAM) per task and 
job 2 uses (1 CPU, 1 GB RAM) per task. Job 1’s dom- 
inant resource is RAM, while job 2’s dominant resource 
is CPU. Note that DRF equalizes the jobs’ shares of their 
dominant resources. In addition, because jobs have dif- 
ferent dominant resources, their dominant shares exceed 
50%, i.e., job 1 uses around 70% of the RAM while job 
2 uses around 75% of the CPUs. Thus, the jobs benefit 
from running in a shared cluster as opposed to taking half 
the nodes each. This captures the essence of the sharing 
incentive property. 

After 2 minutes, the task sizes of both jobs change, to 
(2 CPUs, 4 GB) for job 1 and (1 CPU, 3 GB) for job 
2. Now, both jobs’ dominant resource is CPU, so DRF 
equalizes their CPU shares. Note that DRF switches allo- 
cations dynamically by having Mesos offer resources to 
the job with the smallest dominant share as tasks finish. 

Finally, after 2 more minutes, the task sizes of both 
jobs change again: (1 CPU, 7 GB) for job 1 and (1 CPU, 
4 GB) for job 2. Both jobs’ dominant resource is now 
memory, so DRF tries to equalize their memory shares. 
The reason the shares are not exactly equal is due to re- 
source fragmentation (see Section 6.2). 


7.2 DRF vs. Alternative Allocation Policies 


We next evaluate DRF with respect to two alternative 
schemes: slot-based fair scheduling (a common policy in 
current systems, such as the Hadoop Fair Scheduler [34] 
and Quincy [18]) and (max-min) fair sharing applied 
only to a single resource (CPU). For the experiment, we 
ran a 48-node Mesos cluster on EC2 instances with 8 
CPU cores and 7 GB RAM each. We configured Mesos 
to allocate 8 CPUs and 6 GB RAM on each node, leav- 
ing 1 GB free for the OS. We implemented these three 
scheduling policies as Mesos allocation modules. 

We ran a workload with two classes of users, repre- 
senting two organizational entities with different work- 
loads. One of the entities had four users submitting small 
jobs with task demands (1 CPU, 0.5 GB). The other en- 
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Figure 9: Number of large jobs completed for each allocation 
scheme in our comparison of DRF against slot-based fair shar- 
ing and CPU-only fair sharing. 
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Figure 10: Number of small jobs completed for each alloca- 
tion scheme in our comparison of DRF against slot-based fair 
sharing and CPU-only fair sharing. 
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tity had four users submitting large jobs with task de- 
mands (2 CPUs, 2 GB). Each job consisted of 80 tasks. 
As soon as a job finished, the user would launch another 
job with similar demands. Each experiment ran for ten 
minutes. At the end, we computed the number of com- 
pleted jobs of each type, as well as their response times. 

For the slot-based allocation scheme, we varied the 
number of slots per machine from 3 to 6 to see how it 
affected performance. Figures 9 through 12 show our re- 
sults. In Figures 9 and 10, we compare the number of 
jobs of each type completed for each scheduling scheme 
in ten minutes. In Figures 11 and 12, we compare aver- 
age response times. 

Several trends are apparent from the data. First, with 
slot-based scheduling, both the throughput and job re- 
sponse times are worse than with DRF, regardless of the 
number of slots. This is because with a low slot count, 
the scheduler can undersubscribe nodes (e.g.,, launch 
only 3 small tasks on a node), while with a large slot 
count, it can oversubscribe them (e.g., launch 4 large 
tasks on a node and cause swapping because each task 
needs 2 GB and the node only has 6 GB). Second, with 
fair sharing at the level of CPUs, the number of small 
jobs executed is similar to DRF, but there are much fewer 
large jobs executed, because memory is overcommitted 
on some machines and leads to poor performance for all 
the high-memory tasks running there. Overall, the DRF- 
based scheduler that is aware of both resources has the 
lowest response times and highest overall throughput. 


7.3 Simulations using Facebook Traces 


Next we use log traces from a 2000-node cluster at Face- 
book, containing data for a one week period (October 
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Figure 11: Average response time (in seconds) of large jobs 
for each allocation scheme in our comparison of DRF against 
slot-based fair sharing and CPU-only fair sharing. 
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for each allocation scheme in our comparison of DRF against 
slot-based fair sharing and CPU-only fair sharing. 
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2010). The data consists of Hadoop MapReduce jobs. 
We assume task duration, CPU usage, and memory con- 
sumption is identical as in the original trace. The traces 
are simulated on a smaller cluster of 400 nodes to reach 
higher utilization levels, such that fairness becomes rel- 
evant. Each node in the cluster consists of 12 slots, 16 
cores, and 32 GB memory. Figure 13 shows a short 300 
second sub-sample to visualize how CPU and memory 
utilization looks for the same workload when using DRF 
compared to Hadoop’s fair scheduler (slot). As shown in 
the figure, DRF provides higher utilization, as it is able 
to better match resource allocations with task demands. 
Figure 14 shows the reduction of the average job com- 
pletion times for DRF as compared to the Hadoop fair 
scheduler. The workload is quite heavy on small jobs, 
which experience no improvements (i.e., —3%). This is 
because small jobs typically consist of a single execu- 
tion phase, and the completion time is dominated by the 
longest task. Thus completion time is hard to improve 
for such small jobs. In contrast, the completion times of 
the larger jobs reduce by as much as 66%. This is be- 
cause these jobs consists of many phases, and thus they 
can benefit from the higher utilization achieved by DRF. 


$ Related Work 


We briefly review related work in computer science and 
economics. 

While many papers in computer science focus on 
multi-resource fairness, they are only considering multi- 
ple instances of the same interchangeable resource, e.g., 
CPU [6, 7, 35], and bandwidth [10, 20, 21]. Unlike these 
approaches, we focus on the allocation of resources of 
different types. 
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Figure 13: CPU and memory utilization for DRF and slot fair- 
ness for a trace from a Facebook Hadoop cluster. 
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Figure 14: Average reduction of the completion times for dif- 
ferent job sizes for a trace from a Facebook Hadoop cluster. 


Quincy [18] is a scheduler developed in the context 
of the Dryad cluster computing framework [17]. Quincy 
achieves fairness by modeling the fair scheduling prob- 
lem as a min-cost flow problem. Quincy does not cur- 
rently support multi-resource fairness. In fact, as men- 
tioned in the discussion section of the paper [18, pg. 17], 
it appears difficult to incorporate multi-resource require- 
ments into the min-cost flow formulation. 

Hadoop currently provides two fair sharing sched- 
ulers [1, 2, 34]. Both these schedulers allocate resources 
at the slot granularity, where a slot is a fixed fraction of 
the resources on a machine. As a result, these sched- 
ulers cannot always match the resource allocations with 
the tasks’ demands, especially when these demands are 
widely heterogeneous. As we have shown in Section 7, 
this mismatch may lead to either low cluster utilization 
or poor performance due to resource oversubscription. 

In the microeconomic literature, the problem of equity 
has been studied within and outside of the framework of 
game theory. The books by Young [33] and Moulin [22] 
are entirely dedicated to these topics and provide good 
introductions. The preferred method of fair division in 
microeconomics is CEEI [3, 33, 22], as introduced by 
Varian [30]. We have therefore devoted considerable at- 
tention to it in Section 5.2. CEEI’s main drawback com- 





pared to DRF is that it is not strategy-proof. As a result, 
users can manipulate the scheduler by lying about their 
demands. 

Many of the fair division policies proposed in the mi- 
croeconomics literature are based on the notion of utility 
and, hence, focus on the single metric of utility. In the 
economics literature, max-min fairness is known as the 
lexicographic ordering [26, 25] (leximin) of utilities. 

The question is what the user utilities are in the multi- 
resource setting, and how to compare such utilities. One 
natural way is to define utility as the number of tasks al- 
located to auser. But modeling utilities this way, together 
with leximin, violates many of the fairness properties we 
proposed. Viewed in this light, DRF makes two contri- 
butions. First, it suggests using the dominant share as a 
proxy for utility, which is equalized using the standard 
leximin ordering. Second, we prove that this scheme is 
strategy-proof for such utility functions. Note that the 
leximin ordering is a lexicographic version of the Kalai- 
Smorodinsky (KS) solution [19]. Thus, our result shows 
that KS is strategy-proof for such utilities. 


9 Conclusion and Future Work 


We have introduced Dominant Resource Fairness (DRF), 
a fair sharing model that generalizes max-min fairness to 
multiple resource types. DRF allows cluster schedulers 
to take into account the heterogeneous demands of dat- 
acenter applications, leading to both fairer allocation of 
resources and higher utilization than existing solutions 
that allocate identical resource slices (slots) to all tasks. 
DRE satisfies a number of desirable properties. In par- 
ticular, DRF is strategy-proof, so that users are incen- 
tivized to report their demands accurately. DRF also in- 
centivizes users to share resources by ensuring that users 
perform at least as well in a shared cluster as they would 
in smaller, separate clusters. Other schedulers that we in- 
vestigated, as well as alternative notions of fairness from 
the microeconomic literature, fail to satisfy all of these 
properties. 

We have evaluated DRF by implementing it in the 
Mesos resource manager, and shown that it can lead to 
better overall performance than the slot-based fair sched- 
ulers that are commonly in use today. 


9.1 Future Work 


There are several interesting directions for future re- 
search. First, in cluster environments with discrete tasks, 
one interesting problem is to minimize resource frag- 
mentation without compromising fairness. This prob- 
lem is similar to bin-packing, but where one must pack 
as many items (tasks) as possible subject to meeting 
DRE. A second direction involves defining fairness when 
tasks have placement constraints, such as machine pref- 
erences. Given the current trend of multi-core machines, 
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a third interesting research direction is to explore the use 
of DRF as an operating system scheduler. Finally, from 
a microeconomic perspective, a natural direction is to 
investigate whether DRF is the only possible strategy- 
proof policy for multi-resource fairness, given other de- 
sirable properties such Pareto efficiency. 
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A Appendix: DRF Properties 


In this appendix, we present the main properties of DRF. 
The technical report [14] contains a more complete list 
of results for DRF and CEEI. For context, the following 
table summarizes the properties satisfied by Asset Fair- 
ness, CEEI, and DRF, respectively. 

In this section, we assume that all users have an un- 
bounded number of tasks. In addition, we assume that 
all tasks of a user have the same demand vector, and we 
will refer to this vector as the user’s demand vector. 

Next, we present progressive filling [9], a simple tech- 
nique to achieve DRF allocation when all resources are 
arbitrary divisible. This technique is instrumental in 
proving our results. 


A.1_ Progressive Filling for DRF 


Progressive filling is an idealized algorithm to achieve 
max-min fairness in a system in which resources can 
be allocated in arbitrary small amounts [9, pg 450]. It 
was originally used in a networking context, but we now 
adapt it to our problem domain. In the case of DRF, pro- 
gressive filling increases all users’ dominant shares at the 
same rate, while increasing their other resource alloca- 
tions proportionally to their task demand vectors, until at 
least one resource is saturated. At this point, the alloca- 
tions of all users using the saturated resource are frozen, 
and progressive filling continues recursively after elim- 
inating these users. In this case, progressive filling ter- 
minates when there are no longer users whose dominant 
shares can be increased. 

Progressive filling for DRF is equivalent to the 
scheduling algorithm presented in Figure | after appro- 
priately scaling the users’ demand vectors. In particular, 
each user’s demand vector is scaled such that allocating 
resources to a user according to her scaled demand vec- 
tor will increase her dominant share by a fixed €, which 
is the Same for all users. Let D; = (d;4,.d; 95.25.5dian) 
be the demand vector of user 7, let r;, be her domi- 
nant share®, and let s; = “ik be her dominant share. 
We then scale the demand vector of user 2 by ao 1.e., 
= Di = = (din, di,2, ..., dim). Thus, every time 





Recall that in this section we assume that all tasks of a user have 
the same demand vector. 


a task of user 2 1s selected, she is allocated an amount 
5 Ui, k = €:T, Of the dominant resource. This means that 
the share of the dominant resource of user 2 increases by 
(€- rx) /TE = €, as expected. 


A.2 Allocation Properties 


We start with a preliminary result. 


Lemma 8 Every user in a DRF allocation has at least 
one saturated resource. 


Proof Assume this is not the case, i.e., none of the re- 
sources used by user 2 is saturated. However, this con- 
tradicts the assumption that progressive filling has com- 
pleted the computation of the DRF allocation. Indeed, 
as long as none of the resources of user 7 are saturated, 
progressive filling will continue to increase the alloca- 
tions of user 7 (and of all the other users sharing only 
non-saturated resources). LI 


Recall that progressive filling always allocates the re- 
sources to a user proportionally to the user’s demand 
vector. More precisely, let De = ~(d5 4.0; 95 05.05ian) 
be the demand vector of user 7. Then, at any time ¢ dur- 
ing the progressive filling process, the allocation of user 
2 18 proportional to the demand vector, 


Ait) =O, 2D, = Oi) (din oct) (1) 
where a;(t) is a positive scalar. 
Now, we are in position to prove the DRF properties. 


Theorem 9 DRF is Pareto efficient. 


Proof Assume user 2 can increase her dominant share, 
s;, without decreasing the dominant share of anyone else. 
According to Lemma 8, user 2 has at least one saturated 
resource. If no other user is using the saturated resource, 
then we are done as it would be impossible to increase 7’s 
share of the saturated resource. If other users are using 
the saturated resource, then increasing the allocation of 
2 would result in decreasing the allocation of at least an- 
other user 7 sharing the same saturated resource. Since 
under progressive filling, the resources allocated by any 
user are proportional to her demand vector (see Eq. 1), 
decreasing the allocation of any resource used by user 2 
will also decrease 2’s dominant share. This contradicts 
our hypothesis, and therefore proves the result. L] 


Theorem 10 DRF satisfies the sharing incentive and 
bottleneck fairness properties. 


Proof Consider a system consisting of n users. Assume 
resource k is the first one being saturated by using pro- 
gressive filling. Let 2 be the user allocating the largest 
share on resource k, and let ¢;,, denote her share of k. 
Since resource k is saturated, we have trivially t;,, > 4. 
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Furthermore, by the definition of the dominant share, we 
have-$; "i; % 2 4. Since progressive filling increases 
the allocation of each user’s dominant resource at the 
same rate, it follows that each user gets at least d. of her 
dominant resource. Thus, DRF satisfies the sharing 1n- 
centive property. If all users have the same dominant 
resource, each user gets exactly “ of that resource. As a 
result, DRF satisfies the bottleneck fairness property as 
well. LJ 


Theorem 11 Every DRF allocation is envy-free. 


Proof Assume by contradiction that user 2 envies an- 
other user 7. For user 2 to envy another user 7, user 7 
must have a strictly higher share of every resource that 2 
wants; otherwise 2 cannot run more tasks under 7’s allo- 
cation. This means that user j7’s dominant share is strictly 
larger than user 2’s dominant share. Since every resource 
allocated to user 2 is also allocated to user 7, this means 
that user 7 cannot reach its saturated resource after user 2, 
i.e., t; < tj, where t;, is the time that user k’s allocation 
gets frozen due to saturation. However, if t; < t;, under 
progressive filling, the dominant shares of users 7 and 2 
will be equal at time ¢;, after which the dominant share 
of user 2 can only increase, violating the hypothesis. 


Theorem 12 (Strategy-proofness) A user cannot in- 
crease her dominant share in DRF by altering her true 
demand vector. 


Proof Assume user 7 can increase her dominant share by 
using a demand vector d; #~ d;. Let a;,; and a; ; denote 
the amount of resource 7 user 2 is allocated using pro- 
gressive filling when the user uses the vector d; and dj, 
respectively. For user 7 to be better off using d;, we need 
that @;,, > ai,~ for every resource k where d;;, > 0. 
Let r denote the first resource that becomes saturated for 
user 2 when she uses the demand vector d;. If no other 
user is allocated resource r (a; = O for all 7 ¥ 2), 
this contradicts the hypothesis as user 2 is already allo- 
cated the entire resource 7, and thus cannot increase her 
allocation of r using another demand vector d;. Thus, 
assume there are other users that have been allocated r 
(a; > 0 for some 7 ¥ 2). In this case, progressive fill- 
ing will eventually saturate r at time t when using d;, and 
at time ¢’ when using demand d;. Recall that the domi- 
nant share is the maximum of a user’s shares, thus 2 must 
have a higher dominant share in the allocation a than in 
a. Thus, t’ > t, as progressive filling increases the dom- 
inant share at a constant rate. This implies that z—when 
using d—does not saturate any resource before time t’, 
and hence does not affect other user’s allocation before 
time t’. Thus, when i uses d, any user m using resource 
r has allocation a,,,,, at time t. Therefore, at time t, there 
is only a;,, amount of r left for user 2, which contradicts 
the assumption that a; > aj,r. LJ 
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The strategy-proofness of DRF shows that a user will 
not be better off by demanding resources that she does 
not need. The following example shows that excess de- 
mand can in fact hurt user’s allocation, leading to a lower 
dominant share. Consider a cluster with two resources, 
and 10 users, the first with demand vector (1,0) and the 
rest with demand vectors (0,1). The first user gets the 
entire first resource, while the rest of the users each get 
“ of the second resource. If user 1 instead changes her 
demand vector to (1, 1), she can only be allocated + of 
each resource and the rest of the users get a of the sec- 
ond resource. 

In practice, the situation can be exacerbated as re- 
sources in datacenters are typically partitioned across 
different physical machines, leading to fragmentation. 
Increasing one’s demand artificially might lead to a situ- 
ation in which, while there are enough resources on the 
whole, there are not enough on any single machine to 
satisfy the new demand. See Section 6.2 for more infor- 
mation. 

Next, for simplicity we assume strictly positive de- 
mand vectors, i.e., the demand of every user for every 
resource is non-zero. 


Theorem 13 Given strictly positive demand vectors, 
DRF guarantees that every user gets the same dominant 
share, 1.e., every DRF allocation ensures 8; = 8;, for all 
users 1 and j. 


Proof Progressive filling will start increasing every 
users’ dominant resource allocation at the same rate until 
one of the resources becomes saturated. At this point, no 
more resources can be allocated to any user as every user 
demands a positive amount of the saturated resource. L] 


Theorem 14 Given strictly positive demands, DRF sat- 
isfies population monotonicity. 


Proof Consider any DRF allocation. Non-zero demands 
imply that all users have the same saturated resource(s). 
Consider removing a user and relinquishing her currently 
allocated resources, which is some amount of every re- 
source. Since all users have the same dominant share a, 
any new allocation which decreases any user 2’s domi- 
nant share below a would, due to Pareto efficiency, have 
to allocate another user 7 a dominant share of more than 
a. The resulting allocation would violate max-min fair- 
ness, as it would be possible to increase 2’s dominant 
share by decreasing the allocation of 7, who already has 
a higher dominant share than 2. LI 


However, we note that in the absence of strictly posi- 
tive demand vectors, DRF no longer satisfies the popula- 
tion monotonicity property [14]. 
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PIE in the Sky: 
Online Passive Interference Estimation for Enterprise WLANS 


Vivek Shrivastava 


Abstract 


Trends in enterprise WLAN usage and deployment 
point to the need for tools that can capture interference 
in real time. A tool for interference estimation can not 
only enable WLAN managers to improve network per- 
formance by dynamically adjusting operating parameters 
like the channel of operation and transmit power of ac- 
cess points, but also diagnose and potentially proactively 
fix problems. In this paper, we present the design, imple- 
mentation, and evaluation of a Passive Interference Esti- 
mator (PIE) that can dynamically generate fine-grained 
interference estimates across an entire WLAN. PIE in- 
troduces no measurement traffic, and yet provides an ac- 
curate estimate of WLAN interference tracking changes 
caused by client mobility, dynamic traffic loads, and 
varying channel conditions. Our experiments conducted 
on two different testbeds, using both controlled and real 
traffic patterns, show that PIE is not only able to provide 
high accuracy but also operate beyond the limitations of 
prior tools. It helps with performance diagnosis and real- 
time WLAN optimization, we describe its use in multiple 
WLAN optimization applications: channel assignment, 
transmit power control, and data scheduling. 


1 Introduction 


Radio interference remains a key performance bottle- 
neck for enterprise WLANs [25]. In spite of significant 
progress in planning, deploying, and managing enter- 
prise WLANs, administrators today have very tools that 
can help them understand how much interference exists 
in their network, and how interference patterns evolve 
over time. Building an on-line tool for enterprise-wide 
WLAN interference estimation is particularly challeng- 
ing, because interference is highly dynamic in nature. 
Each time a new client arrives, departs, moves, or 
changes its traffic pattern, the number of other nodes in 
the network it interferes with (and the degree to which it 
interferes) changes. Further, wireless channel conditions 
are never static but continuously evolve with changes in 
the environment, e.g., even with the opening or closing 
of a door, people walking, etc. 


The goal of this paper is to answer the following 
question: Given an enterprise WLAN consisting of 
a number of Access Points (APs) and mobile clients, 
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compute its real-time conflict graph, i.e., identify the 
precise set of nodes that interfere with each other and 
the degree to which they do so at any specified point of 
time. 


Applications of interference estimation: This prob- 
lem of interference estimation is fundamental to under- 
standing the behavior of any wireless network. Further, 
interference estimates and the conflict graph serve as im- 
portant inputs to many WLAN configuration problems, 
e.g., channel assignment for each AP, transmit power se- 
lection, and even emerging strategies for data scheduling 
across the enterprise WLAN [22]. 

A number of research efforts have made significant 
progress toward this tool building goal. Prior techniques 
for interference estimation mainly employ active prob- 
ing (interference maps [15] and micro-probing [3]) and 
suffer from three main problems: a) they incur moder- 
ate to significant measurement overhead and cannot be 
employed to continuously obtain interference informa- 
tion across time, b) they offer limited visibility into the 
root cause of interference, c) they often require specific 
client modifications. While some recent work has also 
explored the potential for passive interference estimation, 
itis mostly limited to offline trace collection and analysis, 
and thus cannot be employed in real time. 

In this paper, we explore an alternate design for a prac- 
tical online interference estimation mechanism, one that 
does not impose any active measurement traffic on the 
WLAN. It is completely passive in nature, and estimates 
interference by simply observing ongoing traffic at the 
different APs. Specifically, we present the design, imple- 
mentation, and detailed evaluation of a Passive Interfer- 
ence Estimator (PIE) system. 

Our work is inspired by two key passive WLAN mon- 
itoring approaches proposed earlier: Jigsaw [8, 9] and 
WIT [13]. These systems provide us with two useful 
building blocks: (4) a platform for capturing wireless traf- 
fic and merging traces collected from different vantage 
points and (11) specific tools to infer interesting proper- 
ties about the 802.11 network from such merged traffic 
traces. However, both these research efforts stop short of 
addressing our goal of designing a real-time interference 
estimation tool. The key features of PIE are: 

1. It captures dynamic interference information 
quickly and robustly: PIE captures interference infor- 
mation across the entire WLAN within a few hundred 
milliseconds. It can effectively identify the real interfer- 
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ers when multiple overlapping transmitters are present. 
2. It uses real traffic patterns: PIE is passive; it esti- 
mates interference using actual traffic patterns in the net- 
work, capturing the effects of bit rate adaptation, varying 
packet sizes, and traffic burstiness. 

3. It has low overhead and causes no downtime: Be- 
ing passive, PIE does not take away wireless bandwidth 
resources from users. 

4. It does not require client modifications: The PIE 
mechanism is implemented at the APs and a central con- 
troller placed within the enterprise wired network. No 
client modifications are required. 

PIE relies on the accurate timestamping of transmis- 
sions by the AP. These timestamps could be reported ac- 
curately by the firmware of the AP’s wireless card. How- 
ever, most off-the-shelf wireless cards do not expose this 
functionality and hence in our current implementation 
we use a second card at the AP to gather accurate times- 
tamps of wireless transmissions. 


Key contributions 


This paper makes the following key contributions: 

e We identify the key requirements for a practical inter- 
ference estimation mechanism. We then carefully design 
PIE to meet those requirements and report various design 
choices to infer interference in real time. 

e We evaluate the accuracy and agility of PIE using 
both controlled experiments as well as by playing back 
real traffic traces. For 95% of the links, PIE achieves 
accuracy comparable to the state-of-the-art technique of 
bandwidth tests (see 82). We further show that PIE can ef- 
ficiently track the changing interference patterns caused 
by client mobility, variable transmission rates and vary- 
ing traffic loads. Results from our playback of real traces 
indicate that PIE can converge to the correct interference 
estimate within 540 ms, 700 ms and 900 ms for heavy, 
medium and low traffic load periods. This represents up 
to 300 of speed up over bandwidth tests. 

e Demonstrate the utility of PIE in interference miti- 
gation mechanisms: We show the usefulness of PIE by 
integrating it with three interference mitigation mecha- 
nisms 1) Centralized scheduling, 2) Transmit power con- 
trol and 3) Channel assignment. We show that real-time 
conflict information provided by PIE can enhance the 
performance of such mechanisms and outperform band- 
width tests under dynamic settings. 

e Employ PIE to uncover performance issues in two 
production WLANs: We use PIE to monitor two produc- 
tion WLANs. We show that PIE can correctly infer sub- 
tle performance issues like asymmetric channel access 
and hidden terminal problems. 

The rest of the paper is organized as follows. 82 dis- 
cusses the current state of art in wireless interference es- 
timation. The fundamental principles behind PIE are de- 
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scribed in § 3. We present the design and operation of 
PIE in 84. We evaluate and validate our mechanism in 
$5. Finally we conclude in 86. 


Interference | Microprobing 


maps [15] [3] 





No client mods x x 
Online x J J J 
Zero downtime J x J J 
Real traffic x x J J 
No wireless 
control traffic a a ss Vv 


Table 1: Comparing PIE with other interference estimation mech- 
anisms. 


2 Related work 


We classify prior interference estimation and wireless 
monitoring efforts into the following categories. 
Interference estimation tools : Bandwidth test mech- 
anisms [16, 15] systematically transmit a simultaneous 
burst of traffic along each pair of AP-client links and 
observe how the aggregate throughput differs from the 
throughput achieved by each link operating in isolation. 
Recently, Ahmed et al. [3,4] proposed the use of micro- 
experiments, each lasting less than a millisecond, to de- 
tect different kinds of conflict between WLAN nodes. 
Such mechanisms require network downtime and must 
rely on certain traffic pattern to test the interfering links, 
which may be deviant from real traffic scenarios. 

CMAP [24] is a technique designed to solve exposed 
terminal problem using passive conflict graphs. However, 
it requires the interferers to be in the communication 
range of the receiver and will miss conflicts in which the 
interferer is outside the communication range but inside 
the interference range. Further, it requires driver level 
modifications to both APs and clients. Given that CMAP 
relies on modified clients, it is better able to infer uplink 
conflicts as well. However, since the fraction of uplink 
traffic might be limited (as reported for some enterprise 
WLANs [22]), we take the penalty of missing some up- 
link conflicts in order to avoid client modifications. Ta- 
ble 1 presents a comparison of our design of PIE with 
some prior proposed interference estimation tools. 
Wireless monitoring studies: Researchers have recently 
conducted several studies to understand the performance 
of different 802.11 networks using trace collection, fol- 
lowed by empirical analysis. Each system is designed to 
analyze specific aspects of an 802.11 wireless network, 
ranging from physical and link-level behavior [21, 2, 24], 
client coverage [7], to understanding the performance of 
TCP/IP in wireless environments [9]. However, most of 
these mechanisms are geared towards offline analysis of 
wireless traces to derive interesting measures for their tar- 
get 802.11 network. Recently, a short paper [6] proposed 
a machine learning approach to infer high-level interfer- 
ence. However, the proposed technique provides limited 
visibility and does not capture all types of interference. 
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Finally, WIT [13] and Jigsaw [8] are interesting mea- 
surement studies that have influenced some of the design 
decisions in PIE . In WIT, traces are captured using 5 
sniffers in a wireless network and a state machine based 
learning approach is proposed to study the performance 
of the 802.11 MAC protocol in a practical deployment. 
Jigsaw deploys a large wireless monitoring infrastruc- 
ture consisting of 150 sniffers to monitor a production 
WLAN and performs a cross-layer analysis to diagnose 
performance problems. Both these mechanisms present 
excellent insights into the functioning of a 802.11 net- 
work, but unlike PIE, they do not focus on evaluating 
the accuracy and agility of their interference estimation 
mechanisms, especially under interference settings that 
can arise due to client mobility and the use of bit rate 
adaption mechanisms. Also, they do not discuss the inte- 
gration of their interference estimation mechanisms with 
applications like power control and channel assignment. 


3 Interference estimation in PIE 


Interference in an enterprise WLAN can be broadly clas- 
sified into two categories: (a) sender-side interference 
caused by carrier sensing between two transmitters, and 
(b) receiver-side interference caused by collision at the 
receiver. While carrier sensing determines how the trans- 
mitters share the wireless medium, collision-induced in- 
terference determines whether transmissions are success- 
fully decoded at the intended receiver. The goal of PIE is 
to identify both of these interference properties in a non- 
intrusive manner. We now explain the intuition behind 
PIE with the help of a simple example. 

Intuition behind PIE: Consider a scenario from an en- 
terprise WLAN (shown in Figure 1) where APs A and B 
are far enough apart such that they cannot carrier sense 
(CS) each other. Assume that two clients C', and C’p are 
associated to APs A and B respectively. Suppose some 
downlink packets are being enqueued and being transmit- 
ted by APs A and B, for transmission to their respective 
clients, C'4 and C’g. The APs follow the regular 802.11 
carrier sensing mechanism, and transmit to their clients 
whenever possible. 

In PIE, APs A and B periodically send their frame 
transmission timestamps to the controller. Further, the 
frames are tagged with their reception status indicating 
whether this frame transmission was successful or not 
(i.e., whether the AP has received an ACK for this frame 
or not). The controller parses these timestamps and iden- 
tifies the four scenarios shown in Figure 1(b). Looking 
at scenarios 1 and 2, the controller observes that frame 
transmissions from A and B (denoted by P, and Pp) 
overlap in both directions, indicating that A and B do 
not defer to each other, and hence are not within car- 
rier sense range. Additionally, the controller can also 
infer that whenever a transmission for client C’p over- 
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Figure 2: Detecting the carrier sense relationship between two 
links on the basis of timestamps of transmissions by the two trans- 
mitters A and B. Timestamps refer to the MAC timestamp of wire- 
less frames as reported by the wireless card. 


laps with a transmission by AP A, then C’g is not able to 
decode the transmission (i.e., Pg is lost). On the other 
hand, transmissions for C’'4 are not lost despite overlap- 
ping transmissions by AP Bb. Hence the controller con- 
cludes that AP A interferes with link (B, Cg) but B does 
not interfere with (A, C4). The controller can then use 
this information to efficiently mitigate interference for 
C’g. For example, it can perform downlink data schedul- 
ing [22] and allocate different time slots to (A, C'4) and 
(B, Cz). Alternatively, the controller can also assign dif- 
ferent channels to APs A and B, thereby allowing both 
transmissions to proceed simultaneously without any in- 
terference. As this example demonstrates, having accu- 
rate interference estimates could enable the controller 
to improve client performance in an enterprise WLAN 
by employing interference mitigation mechanisms effec- 
tively. We now give a detailed explanation of how PIE 
identifies these interference properties in a non-intrusive 
manner. 


3.1 Estimating carrier sense (CS) 
interference 


PIE identifies the carrier sense relationships based on the 
order in which competing transmitters access the wire- 
less channel. Figure 2 shows the possible order of chan- 
nel access for different carrier sensing relationships. As 
shown, there can be four cases of channel access: 

(a) Overlapping frame transmissions (Cases 1, 2 and 
3): Case 1) When two competing transmitters are not in 
Carrier sensing range, they can access the channel in any 
order and hence the controller would observe that their 
frames overlap in both directions. Case 2,3) In case of 
one-way carrier sensing, the frames will only overlap in 
one direction. For example, if 7; < T> (i.e., 7) car- 
rier senses 75) then 7) will defer for 7>’s transmissions. 
However, 75 will not defer for 7’s transmissions, and 
would transmit even if 7’s frame is still in air. Hence 
the controller should only observe overlaps when 7'’s 
transmission is already in the air and is overlapped by a 
later 75 transmission. 

(b) Non overlapping transmissions (Case 4): If both 
the transmitters can mutually carrier sense each other, 
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Figure 1: Overview of PIE, showing the overall infrastructure, the feedback processing performed at the Controller and the integration 
of PIE with channel assignment and scheduling. The detection of conflict between AP B and client C2A i) places the two APs in separate 
channels when channel assignment is performed, or ii) serializes the transmissions between AP A and B. 


the controller should not see any overlaps as carrier sens- 
ing will serialize their frame transmissions. However, we 
note that non-overlapping transmissions may also be ob- 
served in scenarios where the two transmitters do not si- 
multaneously contend for the channel, and transmit their 
frames one after another due to their specific traffic pat- 
terns. In such a scenario, it is difficult to make any infer- 
ence regarding carrier sense relationship of the two trans- 
mitters. In order to distinguish the cases where transmit- 
ters are actually contending for the medium, we use the 
mechanism outlined in [13]. The controller labels a pair 
of frames as being transmitted by contending” transmit- 
ters if their starting timestamps are within a time interval 
yy, where + is the total time that can be spent by compet- 
ing transmitters performing back-off. Although all traf- 
fic within the + interval may not contend for the channel, 
this heuristic was shown to be effective for practical set- 
tings [13]. We use a value of y = 28 + 320us (DIFS 
+ Max back-off period for 802.11g). The pseudo-code 
for estimating carrier sense properties in PIE is shown in 
Algorithm | (Procedure ComputeCS). 


3.2 Estimating collision induced 
interference 


PIE identifies collision-induced interference at the re- 
ceiver by computing the probability of a frame loss at the 
receiver when it overlaps with a simultaneous transmis- 
sion from a competing transmitter. Intuitively, the extent 
of interference is directly proportional to the probability 
of losing overlapping frames. Note that this allows PIE 
to maintain a continuous interference model, where the 
extent of interference can be any value between 0 and 1. 
Such a model is better suited for realistic environments 
where the binary model of interference may not suffice. 
On the basis of this observation, in PIE, we use the Link 
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Algorithm 1 PIE : CS and INT computation 


Procedure ComputeCs: 
Inputs: number of frames in contention n-, number of case (3) 
overlaps nf, and number of case (2) overlaps n,;, cs threshold 64 
(dz = 0.8 in our implementation) 
No =Nf + Nr 
lin =Ne— Ne 
if (= > 64) then 
/* case 4 (A and B sense each other) */ 
return Ae 


else if (“2 > 6z) then /* sufficient overlaps to compute prob */ 


if ( a. > 64) then 
/* cases 3 (A senses B) */ 
return =< 
Ne 
else 
/* case 1 (A and B do not sense each other) */ 
return 2 
Te 
else 
/* inconclusive (wait for more samples) */ 
return — 
Procedure ComputeINT: 
Inputs: total number of frames n,, number of frames lost n7, num- 
ber of overlapping frames 7 5, number of overlapping frames lost 
Nol, Overlapping packets threshold G; (G¢ = 20 in our implementa- 
tion) 
if (21> > G+) then 
liso =(M1 — Nop)/(Np — No) /*loss in isolation*/ 
lint =No1/No /*loss under interference */ 
LIR = (1 — lint) /(1 — liso) 
return LIR 
else 
/* inconclusive (wait for more samples) */ 
return (—) 


Interference Ratio (LIR) described below, as the metric 
to quantify interference for a link. 


Link Interference Ratio (LIR): For a pair of interfer- 
ing links, LIR captures the loss in performance observed 
when the two links are interfering, as opposed to operat- 
ing in isolation. Consider a link (A, B) and its interferer 
C’. We measure D 4 3, the delivery probability of the link 
(A, B) in isolation (A is active, C’ is inactive). We then 
measure DG ,, the delivery probability of the link when 
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interferer C’ is also active with A. The LIR is given by: 
LIR = DGp/Das (1) 


LIR takes values between 0 and |. LIR of 0 means 
that link (A, B) cannot deliver frames in the presence of 
C’, while LIR of 1 means that C’ does not impact link 
(A, B). LIR values between 0 and | indicate the extent 
of interference on link (A, B) by interferer C. When A 
and C’ are in carrier sense range, LIR will be equal to 1, 
since the interferer C' is able to share the channel with 
the transmitter A without causing any decrease in the de- 
livery ratio of link (A,B) '. The pseudo-code for esti- 
mating interference is shown in Algorithm | (Procedure 
ComputeINT). PIE requires a certain threshold of over- 
lap packets (3;) to accurately estimate the loss rate under 
interference. We use (3; = 40 for our implementation as 
it is the smallest threshold that yields stable interference 
estimates under diverse experimental scenarios. 
Handling simultaneous overlaps from multiple inter- 
ferers: A client packet may overlap with multiple si- 
multaneous transmissions from potential interferers. In 
such a scenario, the packet overlap and its subsequent 
loss or success 1s attributed to each overlapping interferer. 
Further transmission diversity will allow PIE to observe 
events that will distinguish the true interferer from the 
nodes that happened to transmit at the same time (fu- 
ture overlapping transmissions by false interferers will 
not lead to loss). As we show later in our evaluation 
in 85.1.3, there is significant diversity in wireless trans- 
missions in realistic settings to allow PIE to operate effi- 
ciently in practice. 


4 PIE Design and Operation 


In this section, we describe the design and operation of 
PIE. A schematic overview of the overall design can be 
seen in Figure 1. PIE has the following three compo- 
nents. 

Sniffing at the APs: In our current implementation of 
PIE sniffing of the wireless medium is limited to the 
APs in the enterprise WLAN. This allows us to avoid 
the additional overhead associated with the deployment 
and management of extra sniffers in the enterprise build- 
ing. However, sniffing solely at the APs might result in 
reduced coverage of uplink client traffic, as compared to 
a dense sniffer deployment (e.g., as in Jigsaw [8]). In or- 
der to overcome this limitation, we employ the finite state 
mechanisms outlined in [13] (based on 802.11 states) to 
infer some of the missing client transmissions. We note 


'Note that this measure of LIR differs slightly from the interfer- 
ence metric proposed in [16], that relies on effective throughput and 
not delivery probability. However, throughput based LIR is ambiguous 
for carrier sensing scenarios, where a LIR value of 0.5 could mean 50% 
loss or carrier sensing. Hence we use delivery probability as it provides 
greater clarity into the LIR values in all scenarios. 


that even with such mechanisms, it is difficult to cap- 
ture all uplink client transmissions using monitors at the 
AP, and hence PIE may not be able to detect all uplink 
conflicts accurately. However, we accept this penalty of 
missing some uplink client conflicts in order to avoid de- 
ploying additional monitors. 

PIE requires accurate timestamp information for accu- 

rate interference estimation. However, due to limitations 
of the existing Atheros driver and firmware, it is difficult 
to extract the exact time at which a packet is transmitted 
over the medium. In order to overcome this problem, in 
our implementation of PIE, APs are equipped with two 
radios: one radio 1s used for normal packet transmissions 
and receptions, while the other radio is used for captur- 
ing packets on the wireless medium. The Atheros driver 
timestamps every frame that is received over the interface 
using an on-board 64-bit microsecond resolution timer. 
Thus a second radio that captures packets can record the 
exact timestamp of the packet transmission. Moreover, 
the proximity of the two radios ensures that the second 
radio receives the majority of frames transmitted by the 
AP due to capture effect. 
Synchronization of clocks at the APs: PIE needs the 
APs to synchronize their clocks so that the controller can 
compare their packet transmission reports and determine 
the extent of overlap between any two transmissions re- 
ported by the APs. Further, time synchronization should 
be tight to allow accurate 802.11 analysis, on the order 
of 20-30 ys [8]. Prior mechanisms for 802.11 analy- 
sis [8, 9, 13, 26] synchronized the APs by finding com- 
mon beacon packets in their transmission reports. How- 
ever, performing such offline synchronization at the con- 
troller can be time consuming, and impractical for a real 
time interference estimation mechanism. To synchronize 
the clocks across the APs, we use the time synchroniza- 
tion protocol implemented by the Atheros driver [1]. As 
part of the protocol, the AP embeds a 64-bit microsec- 
ond granularity time stamp in every beacon frame, and 
the nodes that listen to the AP adjust their local clock 
based on this broadcast timestamp [12]. In order to make 
this synchronization seamless, we set up a virtual ad hoc 
interface on the second radio of each AP. Now all the APs 
that join the ad hoc network, synchronize themselves in 
real time using the beacons of the reference AP for the 
network. This approach has two key benefits:1) it is an 
online mechanism, meaning the nodes synchronize their 
clocks every time beacons are received from neighboring 
nodes, and, 2) it is transitive in nature, and works as long 
as the network is not partitioned. 


*This is because once the driver passes the packet to the firmware, 
a variable delay is introduced based on the length of the firmware 
transmit queue and the amount of time the radio performs carrier 
sensing/back-off. Further, retry and other 802.11 packets (like beacons) 
are handled solely by the firmware, making timestamp estimation more 
challenging. 
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Section Objective Topology 

8 5.1.1 Accuracy of PIE for 2-link (Hidden / Exposed / 

8 5.1.2 Accuracy of PIE under client mobility, | 2-link (Hidden) topology 
erie iets and pacetsies | 

§ 5.1.3 


8 5.4.1,5.4.2 | Performance of channel assignment, 15-node topology 
8 5.4.3 power control & scheduling with PIE 
§ 5.4.4 


Evaluate accuracy with multiple 15-node topology 
[Simiteneomemmmiten 
6 9.2.2,5.3 Convergence time of PIE under real 15-node topology 

fimconactusfterpay | 


Performance diagnostics in 386 & 464 AP-client links 
two production WLANs 


Observation 

PIE is accurate within +0.1 of ground 

truth for 95% of scenarios 

PIE is able to track the changing 

interference patterns in real-time (~ 100 ms) 
PIE is accurate when transmitters overlap 
less than 75% of time 

Median convergence time is 400, 600, 720 ms 
for heavy, medium and light client traffic 
Outperforms bandwidth tests in dynamic cases 
(1.25 x, 1.50x gain in goodput, fairness) 
8-11% links suffer from hidden terminals and 
20% links show rate anomaly problems 


Table 2: Summary of evaluation results. 


Collecting and processing feedback from the APs: In 
PIE the Controller periodically polls the APs for their 
transmission reports. The granularity of polling is a 
tunable parameter, which can be determined empirically. 
Lower polling periods will enable PIE to update interfer- 
ence estimates faster. On the other hand, increasing the 
polling period allows APs to sample more packets per 
transmission report, increasing the accuracy of interfer- 
ence estimates. We evaluate this tradeoff in 85 and show 
that a polling period of at least ~100 ms is needed to 
achieve good accuracy for PIE . Feedback processing at 
the Controller takes O(m7n) time, where m is the num- 
ber of APs and n is the number of packets per AP?. 
Handling multi-rate links: The exact impact of an in- 
terferer on a transmitter-receiver pair also depends on the 
physical layer bit rate being used by the transmitter. PIE 
tags the LIR value for each link-interferer pair with the 
bit rate being used for packet transmission on the wire- 
less link. During the computation of LIR values as de- 
scribed in §3.2, overlap and isolation losses are recorded 
separately for each physical layer data rate and then the 
corresponding LIR value is computed for each rate. The 
Controller maintains a two-level lookup table for LIR val- 
ues, where the first level is indexed by the link-interferer 
pair and the second level provides values for different 
rates used by the link for the given interferer. This data 
structure can also be extended for tagging conflicts with 
the transmit power level of the interferer, allowing the 
Controller to estimate the level of conflict under differ- 
ent power levels. 

Interaction with external interference: External inter- 
ference can be caused by non-enterprise wireless traffic 
and/or non-WiFi traffic (like microwaves). In the first 
case, if the non-enterprise traffic source is visible to any 
enterprise AP, its transmission timestamps would be re- 
ported to the PIE controller, which could then use the 
normal procedure to detect if the external source is caus- 
ing any problems for the enterprise clients. In the sec- 
ond case, when the external interferer is not visible (like 


Since the transmission report by each AP is already sorted, the 
overhead of merging at the Controller is small. 
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a non-WiFi source or a hidden external WiFi source) to 
any enterprise AP, PIE would not be able to identify the 
source of interference. 


5 Evaluation of PIE 


We divide the evaluation section into three distinct sub- 
sections. First, we demonstrate that PIE accurately cap- 
tures interference in real time. We do so by comparing 
PIE with bandwidth tests. Next, we measure the time 
taken by PIE to converge to accurate interference esti- 
mates, under both controlled traffic loads and realistic 
trace replay on the wireless testbed. Lastly, we inte- 
grate PIE with a number of real ttme WLAN optimiza- 
tion mechanisms to offer evidence that PIE is useful for 
real-time problem diagnosis on a WLAN. 


We evaluate PIE on two different testbeds. We run our 
central controller on a standard Linux PC (3.33 GHz dual 
core Pentium IV, 2 GB DRAM) (in about 3,000 lines of C 
code and a few hundred lines of Perl script), and Soekris 
(Testbed 1) as well as VIA-based (Testbed 2) wireless 
APs, modified slightly to improve path latencies. Each 
node in the two testbeds is equipped with two Atheros 
AR5212 chipset wireless NICs. We use saturated UDP 
traffic for our experiments unless otherwise specified. 


Summary: A summary of the results presented in this 
section is shown in Table 2. Our results show that (1) PIE 
accurately estimates LIR under different carrier sensing 
and interference relationships, (11) PIE can handle client 
mobility, variable bit rates and packet sizes, (111) PIE 
is able to distinguish between multiple interferers when 
overlap in transmissions is less than 75%, (iv) PIE con- 
verges within 100 ms for saturated traffic, and within 400 
ms, 600 ms and 720 ms when heavy, medium and light 
activity traffic periods are replayed from a real trace, (v) 
PIE enables WLAN applications to perform efficiently 
in dynamic scenarios, (vi) PIE can identify performance 
problems in hidden terminals and rate anomaly in pro- 
duction WLANs. 
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Figure 3: Scatter plots comparing the LIR values of PIE with the 
ground truth computed using unicast bandwidth test for all possi- 
ble combinations for carrier sensing and interference relationships 
that can occur in a two link canonical topology. Packet size and 
data rate was fixed at 1400 bytes and 6M respectively. Note that 
for all scenarios, the value computed by PIE is close to the value 
reported by bandwidth test, as indicated by the proximity of these 
values to the x=y line in the plots. 
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Figure 4: Distribution of error in predicting (a) Carrier Sense 
probability, and (b) LIR value as compared to the ground truth 
computed using unicast bandwidth tests, for the sixteen canonical 
scenarios outlined in Figure 3. 


5.1 Accuracy of PIE 


We evaluate PIE’s accuracy using two different methods. 
First, we construct all possible conflict scenarios using a 
canonical two link topology. This experiment serves as 
our controlled experiment that allows us to assess accu- 
racy and focus on the underlying phenomena causing any 
discrepancies between PIE and bandwidth tests. Second, 
we generalize our findings across a large-scale testbed, 
quantifying PIE’s overall accuracy. Overall accuracy is 
further evaluated across a number of dimensions that take 
into account diverse transmission rates, packets sizes, in- 
terference scenarios, and density. 
Metrics for comparison: Both experiments are evalu- 
ated according to the Link Interference Ratio (LIR) de- 
scribed in 83.2. LIR is the ratio of the frame delivery 
probability * of a link (A, B) under interference from C 
and in isolation (D§ ,/ Dg). 
Compared schemes: We compare three approaches that 
measure LIR with differing levels of overhead. 

1) Unicast bandwidth tests (Ground truth): The 


4802.11 ACK is included into frame delivery rate for unicast frames 


conventional approach, is to use unicast bandwidth tests 
(UBT) to determine the impact of an interferer on a 
link [16]. In unicast bandwidth tests, A transmits unicast 
packets to B in isolation and under interference from C’. 
We then report LIR as the ratio of frame delivery proba- 
bilities under the two scenarios. This is an accurate test to 
determine LIR as it uses unicast traffic, which takes into 
account the impact of C’ on the receiver (data packet col- 
lisions) and the sender (ack collisions). Henceforth, we 
use the LIR value reported by unicast bandwidth tests as 
the Sround truth” in our experiments. Note that UBT in- 
curs significant overhead — it takes O(n*) measurements 
to compute a conflict graph for a n node topology, and 
hence is not practical to use under dynamic wireless en- 
vironments. 

2) Broadcast bandwidth tests : In broadcast band- 
width tests (BBT), broadcast traffic from A to B is used 
to compute the frame delivery ratios, both in isolation 
and under interference from C’. This method was pro- 
posed as a relatively fast way to measure interference re- 
lationships among a large number of links [16]. Broad- 
cast tests can compute the conflict graph for a topology of 
n nodes using O(n”) measurements (as opposed to O(n*) 
for UBT). However, broadcast tests do not take data-ack 
collisions into account and hence may be inaccurate in 
some scenarios. 

3) PIE : PIE computes the LIR value in a passive fash- 
ion by determining the conditional loss probability of 
packets on link (A, B) that are interfered by interferer C. 
A packet P; on link (A, B) is considered to be interfered 
if it overlaps with a transmission from interferer C’ that 
leads to packet loss. The LIR in this case is computed by 
passively observing the events in the wireless medium as 
recorded at the controller. Psuedocode for PIE is shown 
in Algorithm | (function ComputeINT). 

In what follows, all experiments are performed using 
802.1 1a (except the live WLAN measurements in 85.4.4), 
to prevent interference from the co-located department 
WLAN that uses 802.11g. Furthermore, the PIE mea- 
surements are collected passively through the observa- 
tion of the probe traffic generated by the bandwidth tests. 


5.1.1 Static interference settings 


We start by comparing the LIR generated by the three 
mechanisms for different canonical scenarios, as shown 
in Figure 3. In order to have a fair comparison, we 
first evaluate the accuracy of PIE under static data rate 
(6Mbps) and packet size (1400 bytes) settings, as the 
overhead for computing LIR for dynamic (client mobil- 
ity, variable rates) can be significant for bandwidth tests. 
We then relax these constraints and evaluate the perfor- 
mance of PIE under dynamic interference scenarios trig- 
gered by client mobility, the use of variable transmission 
rates and different packet sizes. 
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Figure 5: Distribution of error for PIE as compared to LIR val- 
ues computed using UBT. We note that in 95% of the interference 
scenarios PIE is within 0.1 of the actual LIR value. 

Controlled experiments: Using a canonical two link 
topology we benchmark different carrier sensing and in- 
terference scenarios. We selectively disable the carrier 
sensing of transmitters to create the complete set of sce- 
narios. The possible interference relationship between 
the two links assuming that C) is associated with AP 
A, and C2 is associated with AP B are as follows: (i) 
A interferes with Cy and B interferes with C; (A — 
Cy /\ B— C;), (ii) A interferes with C2, B does not with 
Ci (A > C2oAB J Cj), (iii) B interferes with C;, A does 
not with C2 (A | Co A B — C}), and (iv) A, and B do 
not interfere with each others client (A [ Cz A B [ C}). 
Further, the possible carrier sensing relationship between 
the two transmitters are: (i) A and B carrier sense each 
other (A « 8B), (ii) B carrier senses A (A — B), (iii) A 
carrier senses B (A < 8B), and (iv) A and B are not in 
carrier sensing range (A [ B). 

Figure 3 compares the LIR values computed by PIE 
and unicast bandwidth test for the sixteen possible sce- 
narios of carrier sensing and interference between two 
links. It also identifies cases which correspond to mu- 
tual (two-way) and asymmetric (one-way) hidden termi- 
nals. As shown in the figure, the LIR estimates of PIE 
are very close to the values reported by the unicast band- 
width tests. Also, Figure 4 shows the distribution of error 
in estimating carrier sense probability and LIR values for 
these different scenarios. As clear from the figure, PIE is 
able to estimate the carrier sensing and LIR values with 
good accuracy (+0.15) for all scenarios. Note that identi- 
fying both carrier sensing and LIR values accurately can 
characterize client performance under any scenario. For 
instance, in the scenario where the interference relation- 
ship is A — Cy A B — C;, the links can achieve similar 
throughputs when they are carrier sensing and sharing 
the channel (A ~ B) or when they are not carrier sens- 
ing (two-way hidden terminal) and there is close to 40% 
loss rate for the links. PIE can provide this greater visi- 
bility, as to which phenomenon is actually taking place, 
which can then be used by interference mitigation mech- 
anisms. 

Accuracy in larger testbed: We repeat the experiments 
reported in Figure 3 for a large number of link pairs in our 
testbed, comprising 30 nodes spread across five floors of 
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Figure 6: Scatter plot of delivery ratios obtained using bandwidth 
tests (unicast - LIR(Actual), broadcast - LIR(BBT)) and PIE on 43 
link pairs. Note that LIR(BBT) may underestimate the loss rates 
as it does not take the ACK loss into account. 


our department building. We select links whose delivery 
ratio 1n isolation is greater than 0.9 in both directions 3}. 
Figure 5 compares the values of LIR achieved using uni- 
cast bandwidth test and PIE for 43 interference scenarios. 
We note that for 95% of the interference scenarios, PIE 
is within 0.1 of the actual LIR value. We experimented 
with different convergence thresholds and found that con- 
vergence within 0.1 of the actual LIR value was sufficient 
for practical applications (see 85.4 for performance of 
such applications). 

Finally, we note some inaccuracies that are intro- 
duced through approaches like BBT, which aim to col- 
lect interference information at low overhead. BBT will 
mis-estimate when interference impacts the reception of 
ACKs rather than data packets. Figure 6 does indeed con- 
firm that such cases do exist in reality and that they lead 
to the underestimation of loss. 


5.1.2 Dynamic interference settings 


The previous experiments quantified PIE’s accuracy as 
compared to the ground truth generated using unicast 
bandwidth tests. However, PIE is not only able to ac- 
curately capture interference under static conditions, but 
more importantly, also under dynamic conditions. 

Handling client mobility: Any practical interference es- 
timation mechanism must be able to handle client mo- 
bility, i.e. 1t should be able to update the conflict graph 
in real time to reflect the changing interference patterns 
that arise due to client movement. In order to evaluate 
PIE ’s ability to handle mobile clients, we perform a mi- 
cro experiment, where a mobile client is moving away 
from its AP towards a hidden interferer as shown in Fig- 
ure 7. In this experiment, the client 1s moving at a pace of 
0.25 m/s °. The bottom plot in the Figure shows the sig- 
nal strength at the client from the AP and the interferer, 
while the middle and top plots show the throughput of the 
mobile client and the LIR estimate by PIE at each instant 
in the experiment. As shown in the Figure, PIE’s LIR 
estimate decreases as the client moves towards the inter- 
ferer. Furthermore, it closely matches the trend shown 


> We wanted to consider stable links (high SNR) for analysis. In re- 
ality, poor SNR links would rarely be selected during client association 
to APs. 
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Figure 7: PIE ’s ability to track the changing interference patterns 
for a mobile client. In this experiment, a mobile client is moving 
away from its AP towards a hidden interferer. The bottom plot 
shows the signal strength at the client from the AP and the inter- 
ferer. The middle plot shows the throughput achieved by the client 
at each instant. The top plot shows the LIR as measured by PIE. 


by the instantaneous throughput during the experiment, 
which confirms PIE’s accuracy in predicting the end user 
performance in dynamic wireless environments. 
Variable rate and packet sizes Prior research [24, 5] 
has shown that the interference properties of wireless 
links are impacted by the data transmission rate and 
packet size. In order to evaluate PIE for different packet 
sizes and data rates, we repeat our canonical experiments 
with different packet sizes and data rates on multiple 
interferer-link pairs. To evaluate multiple data rates, we 
first activate a link in isolation and then activate an inter- 
ferer, which forces the transmitter to adjust its data rate 
to minimize losses. We use the default Atheros rate adap- 
tation algorithm, SampleRate. Figure 8 (left) shows the 
impact of data rate on the delivery ratio of a link (LIR by 
UBT) and the estimate of LIR generated by PIE for each 
rate in the experiment. 

Next, we fix the data rate and vary the packet size for 
a link under interference (right plot). As expected, LIR 
is worse for larger packet sizes, which are prone to more 
errors. We observe that the combination of data rate and 
packet size can result in varying interference properties 
and PIE is able to efficiently identify the impact of in- 
terference accurately in each such scenario (confirmed 
by the agreement with UBT). This also shows that us- 
ing bandwidth tests or other active measurements may 
require performing an exponential number of tests with 
varying packet sizes and rates to determine the interfer- 
ence impact for any given traffic scenario. PIE, on the 
other hand, can passively determine the extent of inter- 
ference for each scenario efficiently and accurately. 


5.1.3 Classifying interferers accurately 


PIE’s fundamental operation relies on observing overlap 
in transmissions and correlating such events with packet 
loss. One could argue that PIE’s accuracy is likely to be 
affected by scale since the probability of observing over- 
lap in transmissions across the network increases with 
greater scale. Then the probability of identifying the 
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Figure 8: Impact of physical layer data rate and packet size on 
the delivery ratio of a link in a canonical hidden terminal topol- 
ogy. While varying data rate, packet size is fixed at 1400 bytes, and 
while varying packet size, data rate is fixed at 24Mbps. Note the sig- 
nificant drop in delivery ratio with rate while the impact of packet 
size is less pronounced. 90% Confidence intervals were found to be 
tight and hence are omitted for clarity. 
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Figure 9: Ability of PIE to identify true interferers from a set of 
active transmitters. (a) LIR measured by PIE for both the true 
interferer and the non-interfering transmitter as a function of the 
overlap in transmission times. If the overlap fraction is less than 
75%, PIE can distinguish the false and true interferers accurately. 
(b) Overlap in transmission times for all wireless transmitter pairs 
that are active during a one hour time window (2pm - 3pm) in the 
USCD wireless trace. As clear from the trace, about 90% of the 
transmitter pairs overlap less than 20% of the times, providing suf- 
ficient traffic diversity for PIE. 


transmitter responsible for loss becomes much harder. To 
answer this question we attempt to quantify the success 
of PIE in correctly identifying an interferer depending 
on the amount of time that it tends to overlap with the 
transmitter suffering the loss. 


Canonical experiments: Consider a link (A, B) and 
two interferers C'; and C2. We compute the actual LIR 
of the link under C; and C> by performing individual 
unicast bandwidth tests, first with C’; and then with C. 
According to the unicast tests, the LIR of the link under 
interference from C; and C4 is 0.6 and 0.99 respectively, 
indicating substantial interference from C; and no inter- 
ference from C2. We term C’' as the interfering transmit- 
ter and C’2 as the non-interfering transmitter. Our goal is 
to evaluate the accuracy of PIE in identifying the inter- 
fering (C) and non-interfering (C2) transmitters, when 
both C and C’ are activated simultaneously. Both C1 
and C's follow a http traffic model, with sleep and active 
times being drawn from a 802.11 wireless trace [13]. We 
then identify the time periods (1s) in the experiment with 
varying overlaps between the transmission times of C1 
and C’> and measure the LIR values for C’; and C ac- 
cording to PIE. 
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Figure 9(a) shows the LIR obtained by PIE for both 
the interfering (C’) and non-interfering (C2) transmitter 
as a function of the overlap in their wireless transmis- 
sion times. As expected, when the overlap in transmis- 
sion times is close to 100%, PIE is unable to distinguish 
between true and false interferers. When the overlap 1s 
less than 60% PIE can distinguish between the false and 
true interferer. In fact, notice that even for high overlaps 
(close to 75%), the median loss probability for false in- 
terferer is close to 0. Further, as shown in Figure 9(b) 
more than 90% of the transmitters in a real WLAN trace 
(UCSD WLAN [8]) overlap less than 20% of the time, in- 
dicating rich diversity in transmission patterns for wire- 
less users. Such diversity will enable PIE to function 
efficiently in realistic deployments. 


Multiple interferer experiments: To validate the previ- 
ous result with multiple interferers, we repeat the afore- 
mentioned experiments in a larger topology. In our ex- 
periments, we try to emulate the structure of our in- 
building WLAN by placing one testbed AP node near 
each production AP in the environment. We present re- 
sults from a representative topology that randomly dis- 
tributes client nodes into offices. The topology has 7 
APs and 8 clients. Clients connect to the AP with the 
strongest signal strength. Each transmitter follows a http 
on-off model for transmitting data with the on and off 
times derived from the UCSD trace. We classify all in- 
terferers for which the UBT LIR 1s less than 0.8 (> 20% 
loss) as strong (interfering) transmitters and the rest are 
classified as weak (non-interfering) transmitters. 


Figure 10 (a) shows the number of strong and weak 
interferers per client as determined by UBT in our topol- 
ogy. Figure 10 (b) shows the ability of PIE to identify 
multiple strong and weak interferers in this topology. As 
shown in the Figure, the LIR values estimated by PIE are 
within +/- 0.15 of the actual LIR determined by pairwise 
bandwidth tests using unicast traffic (UBT). Summariz- 
ing, PIE is able to accurately identify the exact impact of 
each interferer on every client in the system even in the 
presence of multiple simultaneous transmitters. We show 
the overall impact of such an accurate conflict graph on 
application level performance for wireless clients in the 
system in 85.4. 


5.2 Agility of PIE 


PIE can be integrated in today’s centralized WLANs, re- 
quiring software-only modifications to the central con- 
troller. However, as is apparent from the design section, 
there are a number of knobs in PIE ’s design that are 
likely to affect its accuracy. In this section, we study 
appropriate values for the polling interval, and measure 
PIE’s convergence time under varying loads. 
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Figure 10: Accuracy of PIE for a 8 client, 7 AP topology. (a) Distri- 
bution of strong (LIR < 0.8) and weak (LIR > 0.8) interferers. (b) 
CDF shows the error in PIES estimation of LIR for a link-interferer 
pair as compared to pairwise bandwidth test (UBT). PIE identifies 
both multiple strong and weak interferers accurately (all estimates 
are withing +/- 0.15 of UBT LIR values). PIE is able to identify the 
extent of interference accurately in the presence of multiple strong 
and weak interferers. 
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Figure 11: (a) Impact of polling period on the accuracy of the inter- 
ference measures produced by PIE . LIR value stabilizes for polling 
periods greater than 100ms. The experiment time was adjusted to 
ensure same sample size for different polling periods. (b) Conver- 
gence time for a canonical hidden terminal link as a function of 
traffic load on the link and the interferer. 


5.2.1 Polling interval 


Any online interference estimation mechanism must 
identify conflicts in real time to be useful. In PIE , 
the controller periodically polls the APs for transmission 
summaries and then determines link conflicts. Higher 
polling periods can provide more information to the con- 
troller, thereby improving the quality of interference es- 
timation. However, having a higher polling period also 
makes the system less responsive, which may be critical 
to dynamic interference scenarios. Here we evaluate the 
performance of PIE with different polling periods and de- 
termine the minimum period for which PIE can provide 
stable LIR values. We define a LIR value reported by 
PIE to be stable when the 90th and 10th percentiles of 
the LIR estimates differ by less than 0.1 of the mean LIR 
value. Figure 11 (a) demonstrates that a value of 100 
ms provides a good compromise between reactivity and 
accuracy. 

Note that smaller polling periods will also increase 
the communication overheads for sending traffic reports 
from the AP to the Controller. Using an average packet 
size of 600 bytes, and a medium constantly busy at 54 
Mbps, the AP in PIE will have to store a summary for 
1125 packets. This results in 9 KBytes sent from each 
AP every 100 ms, 1.e. 1 Mbps, easily sustained by the 
AP. 
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Figure 12: Convergence time and accuracy for PIE on a7 AP - 8 Client topology under realistic patterns replayed from a period of (a) light 
client activity and (b) heavy client activity (using TCP). Top part of both figures shows the convergence time for each link-interferer pair 
and the bottom figure shows its corresponding accuracy when traffic traces are replayed on our representative topology. As shown in the 
figure, for light (heavy) traffic scenarios, PIE takes 1150ms (650ms) or less for 95% link-interferer pairs to converge within + 0.1 of their 


actual value. 


5.2.2 Convergence time 


Convergence time is defined as the amount of time taken 
by PIE to gather sufficient samples to compute an accu- 
rate LIR estimate (within +0.1 of ground truth). Accord- 
ingly, the time taken by PIE to converge on an accurate 
estimate for link interference depends on two key factors: 
1) the polling period used by PIE to collect statistics from 
the APs, and 11) the actual amount of traffic that is cap- 
tured by the APs in a given polling period. We first under- 
stand the impact of traffic load on the convergence of PIE 
by systematically varying the load on the canonical two 
link topology. Figure 11(b) shows the convergence time 
for a canonical hidden terminal link as a function of traf- 
fic load on the link and the interferer. Both the link and 
the interferer use a physical data rate of Mbps, while the 
traffic load is varied from 6Mbps (saturated) to 0.2 Mbps 
(light). Reduction in traffic load leads to longer conver- 
gence times because of the reduced frequency of inter- 
ference events. Note, however, that LIR values would 
correspond to perceived client performance degradation 
only under relatively heavy loads, in which case PIE 
could capture events in 100 ms. In contrast, the mea- 
surement overheads of prior bandwidth test based active 
interference estimation mechanisms (e.g. Interference- 
maps [15]) is in the range of 20-30 seconds per link- 
pair [16]. 


Next, in order to understand the convergence of PIE 
under realistic traffic patterns, we replay a real WLAN 
trace [18] on the representative (7AP - 8 Client) topology 
(described in Section 5.1). 


5.3. Experiments with real wireless traces 


We now present experimental results on the performance 
of PIE using the publicly available Sigcomm 2004 traf- 
fic traces [18]. The Sigcomm trace was partitioned into 
heavy, medium, and light periods corresponding to peri- 
ods with airtime utilization of more than 50%, between 
20-50%, and less than 20% respectively, at different 
times of the conference [19]. In these traces, HTTP trans- 
actions were categorized into a series of HTTP sessions. 
Each session consists of a set of timestamped operations 
starting with a connect, followed by a series of sends and 
receives (called transactions), and finally a close. The 
HTTP sessions are then replayed on our testbed using the 
mechanism described in [10]. In our experiments, each 
client emulated the behavior of one real client from the 
trace, faithfully imitating its HTTP transactions. We use 
TCP as the underlying transport protocol for trace replay. 

Figure 12 shows the convergence time (top plot) and 
accuracy (bottom plot) of PIE for each link-interferer 
pair when access patterns from the light and heavy load 
periods are replayed on the representative topology. As 
shown in the figure, for light (heavy) trafc scenarios, PIE 
converges to + 0.1 of the actual LIR value within 1150 
ms (800 ms) for more than 95% of the link-interferer 
pairs ’ Further figure 13 shows the distribution of conver- 
gence time of PIE for different link-interferer pairs under 
all three load periods. As expected, the convergence time 
is smaller for higher activity periods. The median conver- 
gence time for the light, medium, and heavy traffic loads 
are 400 ms, 620 ms, and 700 ms respectively. 


7We skip detailed results from medium activity periods and instead 
show only the distribution for medium activity period to save space. 
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Figure 13: Distribution of convergence time for all link-interferer 
pairs under realistic traffic scenarios. Traffic scenarios (TCP 
based) are classified as heavy, medium and light depending on the 
total traffic load. As expected, PIE’s convergence is faster for heavy 
traffic scenarios (median of 400 ms), followed by medium (median 
of 620 ms) and light (median of 700) traffic. 


5.4 Applying PIE to improve WLAN 
performance 


Being able to track interference in a highly dynamic 
environment may be considered as an admirable aca- 
demic exercise. In this section, we will prove that ac- 
cess to such information can better enable a number 
of real time mechanisms that have been proposed for 
the performance optimization of wireless networks. To 
that end, we have integrated PIE with three such mech- 
anisms (channel selection, dynamic packet scheduling, 
and power control) and tested them on two different 
testbeds. Our results clearly demonstrate that all these 
functions become a viable tool in the hands of network 
operators as long as we can supply reliable interference 
information in real time. 


We use the same 7 AP and 8 client topology that we 
described in 85. We set the polling period to 1 second as 
per our observation in 85.2.2, thus capturing interference 
accurately even under low traffic loads. In mobility ex- 
periments, each client moves along a corridor at ~0.25 
m/s. We use UDP traffic for our experiments to mea- 
sure the performance of PIE with different applications. 
We also perform experiments with TCP traffic for cen- 
tralized scheduling application and report the results for 
the same. 





Conflict Mechanism System Jain’s Fairness 
graph (Num Channels) Tput(Mbps) Index 
N ingle ; 
NA LCCS (3) Vick 0.58 
UBT Conflict aware (3) 24.6 0.72 
PIE Conflict aware (3) 24.9 0.71 


Table 3: Performance of conflict-aware channel assignment (using 
conflict graph generated by PIE and bandwidth tests) as compared 
with single channel and LCCS (least congested channel search) as- 
signments. Under static conditions, PIE leads to similar results as 
UBT, offering significant improvement compared to single channel 
and LCCS assignments. Note that UBT being an active technique 
has significantly higher measurement overhead and is not practical. 
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5.4.1 Application I: Channel assignment 

Efficiently assigning channels to access points (APs) in 
an enterprise WLAN can significantly affect the network 
performance and capacity [14, 20]. We implement a 
conflict aware channel assignment heuristic (Random- 
ized Compaction), proposed in [20], that takes a conflict 
graph as input and performs channel assignment with 
the objective to minimize interference. We compare the 
performance of the conflict-aware channel assignment 
scheme when based on the conflict graph generated by 
PIE and that of unicast bandwidth tests. 

Table 3 shows the total system throughput and Jain’s 
fairness index achieved by each channel assignment 
mechanism. Bandwidth tests are performed with uni- 
cast traffic at data rate and packet size of 6Mbps and 
1400 bytes. Experiments are performed under static set- 
tings for a fair comparison with bandwidth tests. We 
consider the conflict graph generated by bandwidth tests 
as the true interference information. Results are aver- 
aged over 20 runs. We note that conflict aware chan- 
nel assignment significantly improves system throughput 
over LCCS [11] (least congested channel search) and sin- 
gle channel assignments. Moreover, the performance of 
the heuristic is similar with PIE and bandwidth tests, il- 
lustrating PIE’s ability to generate high quality conflict 
graphs in real time. 


Iter(0) [9.2, 0.53] 


Iter(10)[11.2, 0.71] _ Iter(20)[12.7, 0.79] 
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Figure 14: Performance of an iterative power control mechanism 
that uses PIE. Each matrix represents the conflict graph, with over- 
all capacity (total system throughput in Mbps) and Jain’s fairness 
index listed in the title. Intensity of darkness is proportional to 
the extent of interference. The final state corresponds to reduced 
interference, improved overall network capacity and fairness. 


5.4.2 Application II: Transmit Power Control 
We implement a simple centralized power control heuris- 
tic that uses the dynamic conflict information produced 
by PIE to reduce interference in the system. We measure 
the performance of the system through LJ Ray, i.e. the 
sum of LIR values, for all link-interferer pairs in the sys- 
tem. Our goal is then to maximize this value by iterating 
over different power levels of the transmitters. 

In each iteration of power control, we identify the most 
dominant interferer, as the AP that sources links with the 
minimum cumulative LIR. We reduce its transmit power 


USENIX Association 


USENIX Association 


(by 10mW) and recompute the conflict graph using PIE. 
If the new conflict graph has lower cumulative LIR, then 
we discard the new power settings and reduce the power 
level of the next strongest interferer. In this way, we 
always move to a new set of power levels only if it in- 
creases the overall performance of the system. We quit 
when there is no improvement in the overall LIR value 
for 10 iterations. 

Figure 14 shows the impact of such a power control 
mechanism. We present three matrices that capture the 
interference caused by each AP (row) to each client (col- 
umn) in the network (the darker the cell, the stronger 
the interference). The title of each matrix further cap- 
tures the iteration, the overall network capacity, and the 
fairness index. The leftmost matrix corresponds to the 
default power level setting, while the middle and right 
columns indicate the intermediate and final stages of 
the power level settings achieved by the aforementioned 
power control heuristic. We clearly see that our sim- 
ple power control mechanism reduces the overall con- 
flict in the system (matrix cells get increasingly lighter), 
while increasing overall network capacity and fairness. 
The point of this evaluation is not on the power control 
mechanism itself, since there are a number of solutions 
that could achieve such an objective more effectively 
(like [17]). Our focus is to demonstrate the effectiveness 
of PIE when used for power control. 


5.4.3 Application II: Centralized scheduling 


Accurate, fast and scalable conflict graph construction is 
critical for realizing centralized data plane mechanisms. 
In a recent work on centralized data path scheduling 
(Centaur [22]), authors relied on micro-probing [3], an 
online mechanism that performs micro experiments to 
determine link conflicts. Although micro-probing can 
generate an accurate conflict graph in very short time 
scales (4 seconds for a 10 link topology), it may still be 
inefficient in high mobility scenarios, especially given 
the need for silencing the network during the measure- 
ment of the conflict graph. We re-evaluate the perfor- 
mance of Centaur using the conflict graph generated by 
PIE and contrast it to bandwidth tests for consistency. We 
show that PIE improves the performance of Centaur un- 
der high mobility and varying traffic properties (variable 
packet sizes and data rates). 

Table 4 shows the Centaur’s performance when oper- 
ating on conflict information from PIE and bandwidth 
tests respectively, in one static and one mobile scenario. 
The UBT conflict graph is generated using 6 Mbps and 
a fixed packet size of 1400 bytes for static client loca- 
tions. Due to the overhead of recomputing bandwidth 
tests, we use the static conflict graph for the mobility sce- 
nario too. One can clearly see that exploiting real time 
conflict information in scheduling is not only increasing 


Scenario Mechanism System Jain’s Fairness 

Tput(Mbps) Index 
DCF 0.64 
Static(UDP) Centaur (UBT) ; 0.88 
Centaur (PIE) ; 0.84 
DCF 0.60 
Static(TCP) Centaur (UBT) 0.85 
Centaur (PIE) : 0.89 

DCE 0). ).6 
Mobile(UDP) | Centaur (UBT) O71 
Centaur (PIE) : 0.95 





Table 4: Performance of centralized scheduling (Centaur) using 
PIE ’s conflict graph. UBT and PIE lead to equivalent performance 
under static settings. The introduction of mobility confirms PIE’s 
superiority to provide real time information. Note that UBT has 
very high measurement overheads compared to PIE . 


the overall network throughput but also the fairness in- 
dex across clients. More interestingly, the inaccuracies in 
the conflict graph generated using bandwidth tests almost 
negate the benefits of centralized scheduling under mobil- 
ity. We performed similar experiments with auto-rate and 
observe that Centaur with PIE ’s conflict graph provides 
32% overall system throughput gain as compared to us- 
ing the conflict graph generated using bandwidth tests 
under static scenarios (6Mbps, 1400 bytes). 

TCP performance: We also analyze TCP performance 
for different conflict graphs. We observe system through- 
puts (fairness) of 9.5 Mbps (0.60), 12.2 Mbps (0.85) 
and 12.4 Mbps (0.89) for DCF, Centaur(UBT) and Cen- 
taur(PIE) respectively. As expected UBT and PIE per- 
form close to each other and outperform DCF. However, 
as noted earlier, the measurement overhead of UBT is 
much higher than PIE making it impractical for real time 
mechanisms like Centaur. 


5.4.4 Application IV: Wireless troubleshooting 


Beyond PIE’s ability to enable real time performance op- 
timization in enterprise WLANs, its real time nature al- 
lows it to serve as a diagnosis tool that could be used 
proactively by a network operator to avoid performance 
problems. We test this property by running PIE in two 
production 802.11b/g WLANs (W, and W2), co-located 
with our two testbeds. 

These WLANs differ from each other in many signifi- 
cant ways as follows. WLAN, spans 5 floors of a build- 
ing and uses 9 APs manufactured by vendor A. The net- 
work administrator was responsible for conducting RF 
site surveys, identifying locations to place the APs, and 
manually assigning the channel of operation of each AP 
to minimize interference. Exactly 3 APs were placed on 
channels 1, 6, and 11 in WLAN, to minimize the level 
of inter-AP interference. In contrast, W LAN» occupies 
a single floor of a different building, uses 21 APs man- 
ufactured by a different vendor, 6, and features a con- 
troller in charge of dynamic channel assignment. The 
number of APs on each channel, thus, varies over time. 
In WLANp the vendor was responsible for conducting 
the RF site surveys and making AP placement decisions. 
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WLAN HT-Links Anomaly-Link pairs 
(LIR < 0.7) (Ratio < 0.2) 

WLANI 31/386 231 / 1087 

WLAN2 53 / 464 305 / 1391 


Table 5: Performance issues observed in two production WLANs. 
The extent of hidden terminal interference ranges from 8% to 11% 
but can be significant for a small number of links. Rate anomaly 
affects approximately 20% of the links in both networks. 


We select testbed nodes closest to the production APs 

to provide transmission reports to the PIE controller, 
sniffing the transmissions on the operational network. 
We use those reports to measure the carrier sense and 
interference relationships between different links in the 
production WLAN. PIE reveals two performance issues: 
1) Hidden terminals: Performance degradation beyond 
a certain level due to interference can significantly im- 
pact client performance. We set LI Rinresn equal to 0.7 
to identify those links that suffer more than 30% reduc- 
tion in their LIR under interference and classify them as 
hidden terminals. 
2) Rate anomaly: Rate anomaly is a well documented 
problem [23] in wireless environments. If a transmitter 
of a link operating at a high data rate (say 54 Mbps), car- 
rier senses the transmitter of another link operating at a 
low rate (say 6Mbps), then the link operating at higher 
rate will experience significant slowdown in throughput 
(by a factor of 1/10 in this case). We classify a given link 
pair as a case of rate anomaly, when the ratio of their 
transmission rates is less than 0.2. 

Both these issues are observed in both production net- 
works. Table 5 shows the extent of hidden interference 
and rate anomaly in the two WLANs. The extent of hid- 
den interference is rather limited (8% for WLANI and 
11% for WLAN2). For comparison, Jigsaw [8] also re- 
ports that 5% of their links observe an LIR of less than 
0.8. While limited on average, however, we do still ob- 
serve, across both WLANs that hidden interference can 
lead to up to 70% LIR degradation for as many as 4% 
and 3% of the links in WLAN I and 2 respectively. 

In terms of rate anomaly issues, we observe that for 
about 20% carrier sensing link pairs, the transmission 
rates differ by more than 80%. This could be one of the 
reasons for sudden performance slowdown experienced 
by perfectly good quality links in WLANSs. 


6 Conclusions 


We presented a detailed evaluation of a passive, real time 
interference estimation mechanism (PIE ). We showed 
that PIE is accurate in estimating link interference and 
can also adapt to changing interference patterns in real 
time. This enables PIE to be especially effective in real- 
istic wireless environments, where client mobility, vari- 
able transmission rates, and bursty traffic result in chang- 
ing interference scenarios, thereby limiting the useful- 
ness of static bandwidth test mechanisms. Further, we 
showed that PIE is completely passive, does not require 
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client support, and does not cause any network downtime, 
making it attractive for use in real WLAN settings. We 
have integrated PIE with interference mitigation mech- 
anisms like centralized scheduling, transmit power con- 
trol, and channel assignment and showed that PIE can en- 
able these mechanisms to function efficiently and dynam- 
ically by providing an accurate conflict graph in real time. 
We also used PIE to monitor two production WLANs and 
demonstrated that PIE can diagnose certain performance 
issues in real systems. 
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Abstract 


While the under-utilization of licensed spectrum based 
on measurement studies conducted in a few developed 
countries has spurred lots of interest in opportunistic 
spectrum access, there exists no infrastructure today for 
measuring real-time spectrum occupancy across vast ge- 
ographical regions. In this paper, we present the design 
and implementation of SpecNet, a first-of-its-kind plat- 
form that allows spectrum analyzers around the world to 
be networked and efficiently used in a coordinated man- 
ner for spectrum measurement as well as implementa- 
tion and evaluation of distributed sensing applications. 
We demonstrate the value of SpecNet through three ap- 
plications: 1) remote spectrum measurement, 2) pri- 
mary transmitter coverage estimation and 3) Spectrum- 
Cop, which quickly identifies and localizes transmitters 
in a frequency range and geographic region of interest. 


1 Introduction 

Radio Frequency (RF) spectrum measurement studies [9, 
10, 5, 7] have confirmed that vast spans of licensed spec- 
trum, deemed white-spaces, are heavily under-utilized. 
Such studies have helped make a case for allowing unli- 
censed devices to utilize unused parts of the spectrum op- 
portunistically. Opportunistic Spectrum Access (OSA) is 
now increasingly seen as a necessity to meet the grow- 
ing demands of wireless applications. In fact, the his- 
toric FCC ruling in 2008 permitting such opportunistic 
use (and in 2010 allowing use without the need to sense 
primaries) is a testament to the success of these measure- 
ment studies. 

Nevertheless, most spectrum measurement studies to 
date have been conducted in a few developed nations, 
using only a handful of spectrum analyzers. Even today, 
the US remains the only country to have allowed an OSA 
model. Many more measurement studies, especially in 
developing nations, are perhaps necessary to make the 
OSA model accepted worldwide. 

Further, these measurements represent static spectrum 
occupancy information over small parts of a country. 
While spectrum allocation is mostly static today, the 
adoption of OSA will result in much more dynamic use 
of spectrum. Thus, access to real-time spatio-temporal 
maps is beneficial for OSA devices to sense other OSA 
devices and determine which parts of the spectrum are 
free/lightly loaded. However, there exists no infrastruc- 
ture today for measuring real-time spectrum occupancy 
across vast geographical regions. 
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Over the past few years, several researchers have 
proposed novel schemes for efficient media access and 
network design in white-spaces [3, 20]. Other re- 
searchers have proposed novel collaborative spectrum 
sensing techniques [11] to allow robust detection of spec- 
trum occupancy. However, thorough evaluation of these 
techniques using real data is hard today. Further, cross- 
geographic questions such as “How do spatio-temporal 
access usage patterns in India differ from those in the 
US?” or “How would a certain OSA technique that works 
well in the US perform in the UK?” cannot be answered 
today. 

The primary contribution of this paper is SpecNet— 
a platform that allows researchers across the world not 
only to conduct spectrum measurement studies remotely 
in real time, but also implement and test novel distributed 
collaborative spectrum sensing applications for OSA. 
SpecNet advances OSA in several ways. First, it helps 
gather spectrum data in many countries, thereby helping 
the adoption of the OSA model worldwide. Second, by 
providing real-time spectrum occupancy maps, OSA de- 
vices may be able to quickly identify lightly loaded parts 
of the spectrum. Third, it provides real trace data that 
can be used to evaluate novel research ideas in OSA. Fi- 
nally, in countries such as India, where there is no readily 
available database of primary users, it can help create an 
accurate database that can be used by OSA devices. 

In SpecNet (Section 4), participant owners of spec- 
trum analyzers register and connect their instruments to 
the SpecNet server. Each owner volunteers to provide 
time periods when SpecNet users are allowed to use the 
instrument to remotely conduct experiments. SpecNet 
provides its users with a rich API implemented as XML- 
RPC calls. Thus, SpecNet users can develop and re- 
motely execute measurements or distributed sensing ap- 
plications in a programming/scripting language of their 
choice. To the best of our knowledge SpecNet is the first 
programmable distributed spectrum sensing platform of 
its kind. SpecNet can be accessed at [15]. 


SpecNet provides an API that supports three classes 
of users (Section 4.2). For sophisticated users, SpecNet 
provides full access to the low-level APIs of the spec- 
trum analyzer. For policy users and others mainly inter- 
ested in measurement data, say for longituidinal analy- 
sis, SpecNet provides APIs that allow access to historic 
measurement data that SpecNet collects and stores in a 
database. For other users such as network operators or 
government personnel, SpecNet provides a set of high- 
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level APIs that allow these users to write novel appli- 
cations without having to worry about the intricacies of 
the spectrum analyzer. For example, a government user 
interested in spectrum occupancy data need only spec- 
ify the part of the spectrum (e.g., 500-800 MHz), the 
geographical boundary (e.g., specified by a center and 
radius of a circular region), the time interval (e.g., be- 
tween 12:00 - 16:00 hrs today) and the minimum signal 
strength of the transmitter that needs to be detected (say 
-95 dBm). Behind the scenes, SpecNet determines the 
group of relevant spectrum analyzers and their respec- 
tive settings that will help satisfy the measurement re- 
quest, executes the task on these spectrum analyzers and 
delivers the results to the user. Other users such as OSA 
network operators may be interested in determining the 
coverage of their networks at locations where spectrum 
analyzers may not be available. SpecNet provides an 
interpolation tool that uses measurements from nearby 
spectrum analyzers to estimate power at the location(s) 
of interest. 

Given that spectrum analyzers are expensive ($10- 
40K) and their time of availability for SpecNet’s use 
might be restricted depending on the owner’s needs, an 
important design goal for SpecNet is efficient manage- 
ment of spectrum analyzer time. When two or more 
spectrum analyzers lie in the region of interest, it may be 
possible to coordinate their measurements in a manner 
so as to reduce the overall scanning time while satisfying 
the user’s request. One approach could be to partition 
the frequency spectrum equally among all the spectrum 
analyzers in the region of interest. Another approach is 
to leverage the spatial diversity in the locations of the 
spectrum analyzers and partition the scanning efforts ge- 
ographically. Finally, a hybrid approach that combines 
these two approaches is also feasible. 

Two fundamental tradeoffs underlying the very 
physics of spectrum measurements make this problem of 
partitioning the measurement task among spectrum ana- 
lyzers a significant challenge. First, the time-frequency 
uncertainty principle dictates that the finer the resolution 
of the spectrum scan, the longer it takes to perform the 
scan. Second, weaker signals require longer scan times 
to be amenable to detection. Further, the heterogeneity 
in capability as well as processing speeds across differ- 
ent models of spectrum analyzers adds to the complexity. 
SpecNet considers these tradeoffs and uses a novel task 
partitioning scheme for scheduling individual spectrum 
analyzers (Section 5). 

We demonstrate the power of SpecNet through three 
applications (Section 7). The first application is simply 
a spectrum scan that is performed across different coun- 
tries, illustrating the ability to conduct remote measure- 
ments. The second application is a coverage estimation 
application that may be useful to network operators. The 
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application first helps localize a TV transmission tower 
and then predict its footprint so that operators may avoid 
the primary owner of the spectrum. This is especially 
useful in developing countries where a database of pri- 
mary transmitters is unavailable or incorrect. The third 
application is SpectrumCop, which may be of interest 
to government users. Today, it is hard to detect viola- 
tors of spectrum policy unless a primary owner of the 
spectrum complains of interference. The SpectrumCop 
application allows a user to quickly detect and localize a 
transmitter in a given frequency range and geographic re- 
gion, demonstrating the utility of SpecNet’s coordinated 
sensing platform. 
Thus, we make the following contributions: 


e We present the design and implementation of a novel 
platform called SpecNet that allows spectrum analyz- 
ers around the world to be networked and used in a 
coordinated manner for remote measurement as well 
as testing and implementation of distributed sensing 
applications. SpecNet is open for access at [15]. 


e We present a scheduling algorithm for coordinating 
measurements among neighboring spectrum analyzers 
that optimizes spectrum analyzer usage time. 


e Finally, we present three applications that demonstrate 
the value of the SpecNet platform. 


2 Related Work 


Measurement Studies. One of the earliest studies that 
aimed at quantifying spectrum usage [9] is by the Shared 
Spectrum Company. The study, conducted at six differ- 
ent locations in the US, concluded that the average occu- 
pancy of spectrum was about 5.2% in the 30 MHz to 3 
GHz frequency range. A study by McHenry et al. [10] in 
Chicago and New York revealed that the occupancy was 
limited to 17% and 13% respectively. Since then, there 
has been a number of measurement studies [5, 7, 19] in 
different parts of the world. The common finding of all 
these studies has been that spectrum is heavily under- 
utilized. In [4], authors derive various statistics from 
the collected data, and propose a prediction algorithm for 
channel availability. 

All of these studies have been performed using a hand- 
ful (maximum of 4 according to [4]) of spectrum an- 
alyzers scanning spectrum in a small geographical re- 
gion in an uncoordinated fashion. In contrast, SpecNet 
provides a platform for coordinating spectrum analyzers 
across different geographical regions, thus opening doors 
to more interesting measurement studies. Further, it also 
enables building occupancy maps of large geographical 
areas over long durations for longitudinal analysis. 
Whitespace Research. Whitespace networking has 
been gaining attention as an important research field in 
the networking community. In [3], the authors propose 
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a Wi-Fi like system built on UHF whitespaces. Yang 
et al. propose a distributed spectrum access technique 
using frequency agile radios transmitting in orthogonal 
frequencies [20]. Most of these proposals have been 
evaluated in restricted settings. We believe that SpecNet 
would aid whitespace research by allowing evaluation of 
proposals based on broader, more real-world data. For 
instance, spectrum measurement data from different con- 
tinents could be used to evaluate detection techniques. 
Cooperative Sensing & Sensor Networks. Cooperative 
sensing is a well explored topic [11, 16, 6]. The main 
focus of these papers is detecting a primary whose fre- 
quency of transmission and/or location is known. More- 
over, the emphasis is on novel collaborative detection 
techniques. SpecNet and research in collaborative sens- 
ing are complementary to each other. For example, mea- 
surements from SpecNet can be useful for evaluating 
these collaborative detection algorithms while advanced 
collaborative detection techniques can be incorporated 
into the SpecNet platform as an API. 

SpecNet uses Voronoi partitioning for optimizing scan 
time of spectrum analyzers. The use of Voronoi diagrams 
has been proposed in sensor networks as well [17, 2]. 
However, the main motivation for applying a partition- 
ing scheme in sensor networks has been energy savings 
and/or interference avoidance. Thus, the problem formu- 
lations and objective functions are very different. 
Testbeds/Platforms. A number of distributed research 
testbeds/platforms have been built by the community [12, 
1, 18]. To the best of our knowledge, SpecNet is the 
first platform targeted at co-ordinating spectrum analyz- 
ers across geographical regions. 


3 Spectrum Sensing Using Spectrum Ana- 
lyzers - A Primer 


In this section we attempt to answer the question,“what 
are the key settings and choices available to a spectrum 
analyzer user for spectrum scanning and how do they in- 
fluence the spectrum sensing process?” 


3.1 Spectrum Scanning - An Example 


We begin with an example spectrum scan of an active 
wireless microphone depicted in Figure 1. When scan- 
ning using a spectrum analyzer, a user typically needs 
to specify two key parameters—the scanning frequency 
range and the resolution bandwidth. The frequency 
range, (fmin, fmax) 1n MHz, specifies that the user is 
interested in scanning the spectrum from fi, MHz to 
fmazx MHz. In Figure 1, the scanning frequency range 
is (702.05 MHz , 702.35 MHz ). Resolution bandwidth 
specifies the granularity in Hz at which the scan is to 
be performed—the lower the resolution bandwidth, the 
greater the observed detail in the scan. 


Figure 1 depicts the results of the scan at four differ- 
ent resolution bandwidths. When the resolution band- 
width is 1 MHz, the microphone’s transmission is not 
at all perceivable. Upon reducing the resolution band- 
width to 30 KHz, a single clear peak emerges indicat- 
ing the microphone’s transmission. Further reducing 
the resolution bandwidth to 10 KHz reveals even finer 
detail—three distinct peaks, which are the signature of 
an FM-modulated transmission. At 1 KHz resolution 
bandwidth, the three peaks are revealed as distinct sharp 
tones. 
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Figure 1: Effect of resolution bandwidth 


As seen in Figure 1, a lower resolution bandwidth has 
two significant effects on the scan. First, greater detail 
about the signal structure is revealed and second, the 
noise floor is reduced (from -52 to -102 dBm). 


3.2 Occupancy Detection 

Often, the goal behind scanning the spectrum is occu- 
pancy detection, i.e., to determine which parts of the 
spectrum have ongoing transmissions. Fundamentally, 
the problem of occupancy detection attempts to distin- 
guish between signal and noise. While there are sev- 
eral varieties of occupancy detection schemes, perhaps 
the simplest scheme is to check whether the Signal to 
Noise Ratio (SNR) is greater than a certain threshold. 
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Figure 2: Noise floor versus resolution bandwidth 

Dependence of noise floor on resolution bandwidth: 
As we saw in Figure 1, the noise floor depends on the 
resolution bandwidth of the scan. This decrease in noise 
floor arises from the fact that as frequency bins become 
finer, they accumulate less noise. A lower noise floor 
typically results in a greater SNR and consequently more 
reliable occupancy detection. 
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The noise floor (in watts) as a function of resolution 
bandwidth is typically given by 


N«xp (1) 


In Eqn 1, the proportionality constant depends on the 
spectrum analyzer model, the antenna, the cabling, 
etc. Figure 2 depicts the dependence of noise floor on 
resolution bandwidth for three different models of spec- 
trum analyzer. While practical measurements indicate 
minor deviations in linearity, as seen in Figure 2, the lin- 
ear model (Eqn 1) holds approximately true for all spec- 
trum analyzers we used. 

Dependence of detection range on resolution band- 
width: ‘Typically, the farther a transmitter is from a 
spectrum analyzer, the lower the received power at the 
spectrum analyzer. The weaker the received signal, the 
lower the SNR and hence the less reliable its detection. 
Detection range of a spectrum analyzer at a certain res- 
olution bandwidth is the farthest distance from which an 
ongoing transmission can be detected reliably. 

Path loss models such as the Log Distance Path Loss 
(LDPL) model are typically used to estimate received 
power as a function of distance. The received power P,. 
at a distance d from a transmitter transmitting with power 
Po based on the LPDL model is given by 


P, = Po — 10ylog (d) + LE (2) 


In Eqn 2, y (usually between 2 and 3 for outdoor envi- 
ronments) is the path loss exponent and L dB (usually 
modeled as a Gaussian with standard deviation between 
5-10 dB for outdoor environments) is a random variable 
that captures variations in the signal due to fading effects. 

If A is the minimum SNR required for reliable occu- 
pancy detection using a certain detection scheme, then in 
order to detect a transmission from a distance d, the noise 
floor must be A dB less than P,., i.e., Py — 107 log (d) — 
A. Since noise floor is dictated by the resolution band- 
width (Eqn 1), this in turn implies that one must choose 
a lower resolution bandwidth to reliably detect a trans- 
mitter that is farther away from the spectrum analyzer. 
The dependence of detection range d on resolution band- 
width can be derived from (Eqn 1) (after converting from 
dB) as 


Pav 

0x (10 10 jan (3) 
Eqn 3 indicates an important aspect of detecting trans- 
missions from a distance, namely, the maximum usable 
resolution bandwidth decreases super-linearly (as d”) 
with detection range. 


4 SpecNet Architecture 

SpecNet is a shared infrastructure consisting of geo- 
distributed, networked, programmable spectrum analyz- 
ers that are contributed and used by the community. The 





NSDI 711: 8th USENIX Symposium on Networked Systems Design and Implementation 


a Master Server 












Slave Servers 


eos 





import xmlrpclib; 
APIServer = 
xmlrpclib.ServerProxy(http://bit.ly/Sp 
ecNetAPI, allow_none=True) ; 

devices = APIServer.GetDevices(None, 
None); 





Figure 3: SpecNet Architecture 

following two goals drive the design of SpecNet. /) Ease 
of Use: We expect SpecNet to support the needs of three 
different classes of users. First, sophisticated users such 
as whitespace researchers will likely need real-time, low- 
level access to the full functionality of the spectrum an- 
alyzers. Second, some users such as spectrum policy re- 
searchers may simply need access to the data collected 
by the spectrum analyzers. Finally, users such as sec- 
ondary network service providers or government person- 
nel interested in spectrum monitoring may require high- 
level APIs that abstract the details/complexity of Spec- 
Net and provide services such as tower localization or 
spectrum occupancy detection. 2) Efficiency: Given that 
spectrum analyzers are expensive ($10-40K) and may be 
available to SpectNet for limited duration, it is important 
that the usage of spectrum analyzers be optimized where 
possible. Since the spectrum analyzers cannot be arbi- 
trarily “time-sliced” for fine-grained sharing, optimiza- 
tion requires completing each task as efficiently as pos- 
sible. We now present an overview of the SpecNet archi- 
tecture. 


4.1 Overview 


The SpecNet architecture is shown in Figure 3. It 
contains three key components: users or clients, slave 
servers that comprise laptops/PCs connected to spec- 
trum analyzers, and master servers that manage the slave 
servers. The typical work-flow is as follows: clients sub- 
mit jobs to the master servers; the master servers trans- 
late these jobs into spectrum analyzer commands based 
on Standard Commands for Programmable Instruments 
(SCPI) [14]. The master server also schedules these at 
the appropriate slave server nodes for execution at the de- 
sired/available time. The output of the commands is then 
either forwarded immediately to the client or the client is 
notified of when/where the output data from the submit- 
ted job would be available. 

XML-RPC: In order to support a wide range of client 
platforms, the SpecNet service is exposed by the master 
servers as XML-RPC calls, i.e., remote procedure calls 
that are encoded in XML and transported over HTTP 
using the XML-RPC standard. This allows clients to 
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post jobs using the SpecNet APIs from any Internet- 
connected node, written in any language of their choice. 

Push-vs-Pull: The jobs posted to the master server 
can either be pushed to or pulled by the slave servers. 
While a pull-based publish-subscribe model is less com- 
plex in terms of state maintenance at the server, it is not 
suitable for SpecNet users who may want to execute jobs 
with inter-dependent API calls that require reaction at 
sub-second intervals (see the Spectrum Cop application 
in Section 7.3). We thus adopt a push-based model where 
a persistent TCP connection is maintained between the 
slave servers and the master servers and jobs are pushed 
to the slave servers. 

Registration: Users contributing slave servers need 
to first register with the SpecNet master server. They 
may specify times during which the nodes are available 
to SpecNet. Upon completion of registration, a simple 
daemon is downloaded and executes on the slave server. 
This software establishes an outbound persistent TCP 
connection to the master server and another connection 
to the spectrum analyzer, thereby serving as a bridge be- 
tween the master server and the spectrum analyzer. 

Benchmarking: The master server first runs a suite 
of experiments to benchmark the fundamental character- 
istics such as noise floor and scan times of each spec- 
trum analyzer (details in [8]). This benchmarking helps 
the master server efficiently schedule jobs at the slave 
server nodes. Further, this is also necessary for abstract- 
ing some of the low-level details of the spectrum ana- 
lyzer through higher-level APIs, necessary for masking 
some of the heterogeneity among spectrum analyzers. 
We discuss this next. 


4.2 APIs 


As mentioned earlier, SpecNet is designed to support 
three classes of users. Table 4.2 lists a subset of the APIs 
supported by SpecNet. 

For sophisticated users who require low-level access 
to the spectrum analyzer, SpecNet has a reservation API 
that users can use to reserve a block of time on the de- 
sired slave servers. The users can then issue their de- 
sired low-level commands, which are simply forwarded 
through the master server to the slave servers for execu- 
tion. 

For policy users and others who are interested 
mainly in spectrum usage data, possibly for longitu- 
dinal studies, SpecNet schedules up to 10% of the 
available time at each slave server for itself. Dur- 
ing this time, the server performs a high resolution 
scan of the entire spectrum, stores this data in a 
SQL database and exposes this data to users through 
APIs such as GetPowerSpectrumHistory() or 
GetOccupancyHistory(). This stored data can 
also serve as a cache and may help respond (partly) to 


other submitted jobs. 

The interesting challenges in SpecNet’s design arise 
mainly in supporting the third class of users (e.g., net- 
work operators). These users may require support for 
high-level APIs that abstract out many of the details of 
using spectrum analyzers. While we have designed a 
few of these APIs (6-9 in Table 4.2), we expect the set 
of high-level APIs to expand over time based on interest 
and through community contributions. 

Localization and Interpolation: Estimating the ge- 
ographical coverage of a primary transmitter is essential 
to creating a spectrum usage map. However, this requires 
knowledge of specifics of the transmitter such as its loca- 
tion and transmit power. Such information is usually not 
available or may be incorrect, especially in developing 
countries (Section 7.2). 

In order to localize transmitters, SpecNet provides 
the LocalizeTransmitter() API that uses signal 
strength observed at spectrum analyzers from various lo- 
cations but does not require input of parameters such as 
location and transmit power of the transmitter. Instead, 
SpecNet estimates these parameters that best explain the 
signal observations (in least mean square error terms) us- 
ing well known path loss models such as Longley-Rice 
or Log Distance Path Loss (LDPL). The number of un- 
knowns that can be estimated, however, fundamentally 
depends on the number of different locations from which 
signal strength was observed. In case of the LDPL model 
(Eqn 2), for example, if signal strengths from only three 
locations are available, SpecNet sets y = 3, takes the 
transmit power (Po) as input from the user and estimates 
the location through triangulation. If signal strength from 
four different locations are available, SpecNet can esti- 
mate Po and the transmitter location simultaneously by 
choosing y = 3. When observations from five or more 
locations are available, SpecNet can estimate the trans- 
mitter location, transmit power Po and + simultaneously 
that best fit the observations. Once the location of the 
transmitter and other parameters are determined, con- 
structing a spectrum map is straightforward. SpecNet 
provides the FindPowerAtLocation() API that 
takes these parameters and predicts the likely received 
power at desired new locations (e.g., locations with no 
spectrum analyzer). 

Spectrum Occupancy Detection: The next two 
high-level APIs help users obtain spectrum occupancy 
at desired locations. The GetPowerSpectrum( ) is 
simply a spectrum scan over a given frequency range on a 
given device, except that users do not even need to spec- 
ify the resolution bandwidth. Instead users can specify 
a region and desired minimum power level of transmit- 
ter to be detected. SpecNet then automatically chooses 
the best resolution bandwidth (based on the fundamental 
properties of occupancy detection discussed in Section 3) 
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Description 


Low-level APIs (e.g., for sophisticated users) 


GetDevices([Boundary], [Timespan] ) 
ReserveDevice(ID, Timespan) 
RunCommandOnDevice(ID,Command) 


1 
2 
3 


Returns a list of spectrum analyzer IDs. Fewer/no arguments possible. 
Reserves and returns success, if available. 
Issues SCPI command to device and returns result. 


Commands to access stored data (e.g., for policy users) 


GetPowerSpectrumHistory(ID, Fs, Fe, Timespan) 
GetOccupancyHistory(ID/Boundary, Fs, Fe, 
Timespan, Threshold) 


4 
5 


Returns (avg) power values from device for given time/frequency range (Fs-Fe). 
Returns 0-1 list indicating occupancy in Fs-Fe at device or in region, based on 
threshold. 


High-level APIs (e.g., for operators or government users) 


6 LocalizeTransmitter(Boundary, Locations, Powers, 
Model, Parameters) 
FindPowerAtLocation(Location, [Transmitter 
Parameters], Model, [Model Parameters] ) 
GetPowerSpectrum(ID, Fs, Fe, [Boundary, 


P)) 


GetOccupancy(ID/Boundary, Fs, Fe, P) 





Localizes transmitter inside area, given observed power level(s) at location(s) 
using Model (LDPL, HATA, Longley-Rice, etc.) . 

Interpolates power at new location given transmitter location/parameters and 
model; useful for estimating coverage of transmitter. 

Schedules a scan for given frequency range (SpecNet determines optimal reso- 
lution bandwidth) in order to detect minimum power level P in given area. 
Provides a 0-1 list corresponding to frequencies occupied at a device or region. 
P is the minimum transmitter power (SpecNet minimizes scan time). 


Table 1: Core APIs supported by SpecNet 


and returns the results. GetOccupancy() API goes 
further by allowing the user to specify a region of interest 
for detecting occupancy of signals above a given thresh- 
old, without even identifying the desired slave server 
IDs. This API is useful for applications like Spectrum 
Cop (Section 7.3), which monitor unauthorized spectrum 
usage. To support this API, SpecNet computes the opti- 
mal set of spectrum analzyers and their corresponding 
resolution bandwidth values that minimize scan time and 
returns the results. Optimizing scan time across multiple 
spectrum analyzers is a challenging problem which we 
discuss next. 


5 Task Scheduling in SpecNet 


SpecNet allows users to deploy and execute spectrum 
sensing applications in real time. Users expect their sens- 
ing tasks to be dispatched and completed as soon as pos- 
sible. Consequently, SpecNet schedules participant spec- 
trum analyzers in a manner so as to minimize task com- 
pletion time. In this section we describe the challenges 
posed in the design of a task scheduler for SpecNet. 


5.1 Scanning Time of a Spectrum Analyzer 
For a spectrum analyzer, the time to perform a scan from 
fmin MHz to fmaz MHz depends on two parameters 
namely, span Q = fmaz — fmin and the resolution band- 
width p used for the scan. Increasing the span requires 
a spectrum analyzer to scan a larger part of the spectrum 
and consequently requires a longer scan time. Scanning 
at a smaller resolution bandwidth requires a larger num- 
ber of samples to be collected in order to reliably esti- 
mate the power in each of the finer frequency bins and 
hence, more time. For modern spectrum analyzers, the 
scan time may be modeled as 


Tx 


fe 


(4) 


In Eqn 4, T' is the scanning time. The proportionality 
constant in Equation 4 can vary significantly across dif- 
ferent models of spectrum analyzers as discussed next. 
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Theory versus Reality : Figure 4 depicts the scan times 
measured from different spectrum analyzers at different 
resolution bandwidths as a function of span. As seen 
from Figure 4, the dependence of scanning time on span 
() is strictly linear as dictated by Eqn 4. Consequently, it 
is convenient to characterize scan times of spectrum an- 
alyzers in terms of scan time per MHz, T. The scanning 
time fora scan from fimin to fmaz 1S then determined by 
the product (jf 4ae — Jagat 

Figure 5 depicts the measured scan times per MHz (rT) 
as a function of resolution bandwidth for three different 
models of spectrum analyzers in a log-log plot. Based on 
Eqn 4, the variation of scan times with resolution band- 
width should be linear. However, Figure 5 indicates sig- 
nificant departure from linearity. Rather the variation is 
piece-wise linear. For example, for FieldFox N9912A, 
the variation is linear in sections A-B and C-D sepa- 
rately. The piece-wise linearity arises because spectrum 
analyzers likely use different sets of circuits and modes 
for different ranges of resolution bandwidths and these 
circuits/modes presumably have different performance 
characteristics. To allow for these non-linearities, Spec- 
Net maintains lookup tables 7(p) describing the scanning 
time per MHz for a given resolution bandwidth setting 
for each spectrum analyzer. 


5.1.1. Minimizing Scan Time by Automatic Resolu- 
tion Bandwidth Selection 


When scanning a part of the spectrum, users often care 
about having a low noise floor. The noise floor, how- 
ever, aS discussed in Section 3, depends on the resolu- 
tion bandwidth chosen. SpecNet allows users to request 
a scan by a remote spectrum analyzer by specifying the 
maximum tolerable noise floor. Behind the scenes, Spec- 
Net determines the resolution bandwidth that provides 
for the fastest scan time that satisfies the required noise 
floor. In order to enable such an API, the SpecNet server 
maintains lookup tables that provide scanning times per 
MHz at various resolution bandwidths, for each Spec- 
trum Analyzer connected to SpecNet. 
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Figure 4: Scanning time versus span Figure 5: Scanning time per MHz ver- 
sus resolution bandwidth 


Dependence of Scanning Time on Detection Range: A 
greater detection range requires using a narrower resolu- 
tion bandwidth (Section 3). This in turn implies that to 
increase the detection range of a spectrum analyzer one 
must accept a longer scanning time. More specifically, 
from Equations 3 and 4, scanning time depends on de- 
tection range as 


Po9-A 
O 


Ting (10- 





) on (5) 


Eqn 5 reveals a crucial aspect of sensing —namely, scan- 
ning time increases super-linearly with increase in de- 
tection distance and linearly with span. As described in 
Section 5.2, SpecNet uses this dependence to efficiently 
share load among spectrum analyzers given a scanning 
task. 

To account for the deviations in scanning time from 
Equation 4 as depicted in Figure 5, given a detection 
range d , instead of using Eqn 5, SpecNet uses the lookup 
table T(p) to determine the resolution bandwidth that has 
the fastest scanning time per MHz while ensuring a min- 
imum noise floor of Py — 107 log (d) — A. Po = —50 
and A = 10 are chosen as default unless specified by the 
user and y = 3 1s chosen as a conservative estimate. 

Evaluation: Given a detection range, SpecNet 
chooses a resolution bandwidth so as to minimize scan- 
ning time. How well does the resolution bandwidth se- 
lection scheme work in practical deployments? A resolu- 
tion bandwidth chosen too low will take too long to scan 
while a resolution bandwidth chosen too high will not 
provide the necessary SNR to allow detection. There are 
several practical considerations. First, the path loss ex- 
ponent is not a fixed quantity and depends on the nature 
of the environment. Line of sight and non line of sight 
paths offer different path loss characteristics. Further, 
significant signal attenuation often occurs due to walls in 
indoor environments. 

To answer this question, we tested SpecNet in a real 
deployment at the Indian Institute of Science (IISC) cam- 
pus as depicted in Figure 7 on two different models of 
spectrum analyzer. The campus is lush with very dense 


trees and this provided an excellent opportunity to eval- 
uate SpecNet in various scenarios such as Line of Sight 
(LOS), Non-Line of Sight (N-LOS) and Indoors. In Fig- 
ure 7, two different models of spectrum analyzer are lo- 
cated at O, while a wireless microphone was placed at 
six different locations, two each in the LOS, NLOS and 
indoor categories. In each of the six detection experi- 
ments, the detection range was set to the exact distance 
between the microphone and the spectrum analyzer. Po 
was set to -35 dBm which was determined by measur- 
ing the power of microphone at a distance of Im. For all 
our experiments we fixed A = 10dB. In other words, 
given a detection range, SpecNet must choose the res- 
olution bandwidth that provides the minimum scanning 
time while ensuring that the SNR is a minimum of 10 
dB. Table 8 provides a summary of the results. 


Line of Sight: As seen from Table 8, for both the LOS 
experiments and for both spectrum analyzers, SpecNet 
chose a very conservative noise floor—while the target 
SNR is 10 dB, the observed SNR is about 25 dB. Figure 6 
depicts the decay of signal strength with distance for the 
microphone in line of sight. The path loss decay expo- 
nent 7 was estimated to be around 2.5, however, SpecNet 
conservatively chooses y = 3.0 in estimating the target 
noise floor. This results in the conservative choice of the 
resolution bandwidth. 

Non Line of Sight: For NLOS experiments, the reso- 
lution bandwidth choice of SpecNet allows for an SNR 
close to the target 10dB for both spectrum analyzers in- 
dicating that 7 was closer to 3 for these experiments. 
Indoors: When the microphone was kept indoors, how- 
ever, SpecNet finds itself underestimating the signal de- 
cay. For example, in both the experiments, the chosen 
resolution bandwidths allow only SNR of about 6 dB 
rather than 10 dB. 


While choosing a conservative resolution bandwidth 
ensures detection, it results in longer scanning times. 
What is the loss in scanning time due to the conserva- 
tive choices of resolution bandwidth? To answer this 
question, we attempted to detect the microphone at sev- 
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Figure 6: Decay in received signal 
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Figure 7: Occupancy detection "UM analyzers 


using a single spectrum analyzer 


eral different resolution bandwidths without the use of 
SpecNet’s resolution bandwidth selection. We then de- 
termined the optimal resolution bandwidth for each ex- 
periment that allowed an SNR of 10 dB. Table in figure 8 
depicts the loss in scanning time in seconds due to the 
sometimes conservative choice of SpecNet for each ex- 
periment. As seen from table, the loss in scanning time 
is in the range of a few milliseconds most of the time 
and up to a few seconds in some cases. Thus, we con- 
clude that the automatic resolution bandwidth estimation 
in SpecNet works as intended. 


5.2 Occupancy Detection 


In many practical applications of occupancy detection, 
users are interested in spectrum occupancy in a specific 
geographic region. For example, “are there any ongo- 
ing transmissions in the spectrum range 700 MHz to 800 
MHz within a 5 km radius of my location?” SpecNet 
allows users to specify a circular region specified by a 
center and a radius for spectrum measurement. Behind 
the scenes, SpecNet determines the set of relevant spec- 
trum analyzers that can be used to accomplish this task. 
Any spectrum analyzer whose maximum detection range 
(determined by the lowest resolution bandwidth) over- 
laps with the user-specified region of interest is deemed 
relevant. When there are multiple relevant spectrum ana- 
lyzers, SpecNet schedules the scanning task load among 
them so as to minimize the overall scanning time. 


5.2.1 Load sharing across multiple spectrum ana- 
lyzers 


There are two distinct dimensions along which a scan- 
ning task can be shared among multiple spectrum ana- 
lyzers, namely, spectrum and geography. Spectral load 
sharing involves different spectrum analyzers scanning 
complementary parts of the spectrum while geographical 
load sharing involves different spectrum analyzers scan- 
ning different spatial sections of the overall geographi- 
cal area of interest. SpecNet uses a combination of both 
these techniques to minimize overall scanning time. 
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Figure 9: Occupancy detection 
using two spectrum analyzers 


The Scheduling Metric: If n different spectrum analyz- 
ers are scheduled to share a certain task load, they scan 
in parallel and accomplish their respective sub-tasks in 
parallel. Suppose that the i*” spectrum analyzer takes 
time 7; to complete its assigned sub-task. The task is 
deemed complete when all spectrum analyzers have ac- 
complished their respective sub-tasks. Since all spectrum 
analyzers are tasked in parallel, the time to task comple- 
tion is given by T = max (7\, 7>,---,7;,). The goal of 
the SpecNet task scheduler is to minimize the task com- 
pletion time. Hence, SpecNet attempts to schedule var- 
ious spectrum analyzers in such a manner that the max- 
imum over all sub-task completion tasks is minimized 
i.€., 1N a min-max manner. 

Spectral Load Sharing: Figure 9 depicts a circular 
region of interest and two spectrum analyzers S1 and 
S2 located at X1 and X2 that can potentially be used to 
scan the circular region of interest. Suppose that the user 
needs to scan from finin MHz to fimaxs MHz. S1 and S2 
could then share the task such that S1 scans from fimin 
MHz to fmin +Q1 MHz, while S2 scans from finjn +Q1 
MHz to fmaz. Such spectral load sharing results in a re- 
duction in span for the participant spectrum analyzers, 
thus reducing the overall scanning time. 

In the above example @, must be chosen in a man- 
ner so that the maximum of the scanning times of S1 and 
S2 are minimized. In order to detect any transmission 
in the entire region of interest, S1 must have a detection 
range equal to |X ;O | = d; where O1 corresponds to the 
farthest possible transmitter location within the region of 
interest from S1 (as depicted in Figure 9). Similarly, the 
detection range of S2 should be | X20 | = dz in order to 
detect any transmitter in the region of interest. Let 7; be 
the minimum scanning time per MHz for spectrum ana- 
lyzer S; required to achieve a detection range of d;. Then 
the overall scanning time is given by max (71Q1, T2Q2), 
where Q2 = fmaz — fmin — Qi. The optimal choice 
then corresponds to when 
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Eqn 6 can be easily generalized to spectral partitioning 
for several spectrum analyzers. In case of several spec- 
trum analyzers, the span of spectrum allocated to each 
spectrum analyzer is inversely proportional to the mini- 
mum scanning time per MHz required to scan the circu- 
lar region of interest. 

Geographical Load Sharing: Another way to share 
the load between S1 and S2 (Figure 9) is to partition 
the region of interest geographically by requiring them 
to scan only parts of the region of interest rather than 
the entire region. In Figure 9, the region is divided into 
two sections by the line |O/,O04|. S1 and S2 are deemed 
responsible to scan each of the two sections. The advan- 
tage of partitioning in this manner is that individual spec- 
trum analyzers can now use a smaller detection range. As 
seen in Figure 9, SI and S2 use detection ranges equal to 
|X1,O01| = d, < d, and |X2O%5| = d, < dp respectively. 
As described in Equation 5, reduced detection range im- 
plies reduced scanning time. Thus, each of the spec- 
trum analyzers takes a shorter time to scan its respective 
region—thus reducing overall task completion time. 

Since every spectrum analyzer scans a different geo- 
graphical region, each must scan the entire spectrum of 
interest fmin to fmax. If the scanning times per MHz 
of nm geographically task sharing spectrum analyzers are 
given by 71,72,°°:,7, then the over all task completion 
time will be max (Q71, Q72,:--,Q7»). Consequently, 
in order to minimize over all task completion time, we 
need 7; = 7, Vz such that 7 is minimized while ensuring 
that the entire area of interest is covered. 

First consider the case of homogeneous spectrum an- 
alyzers. Ensuring equal 7; translates to ensuring equal 
maximum detection ranges to all the spectrum analyzers. 
This problem can be optimally solved using Voronoi par- 
titioning with each spectrum analyzer being treated as a 
Voronoi site. Each Voronoi cell, then, would correspond 
to the geographical region assigned to the spectrum an- 
alyzer. The resolution bandwidth of each spectrum an- 
alyzer would correspond to the detection range required 
to accommodate the farthest point in its Voronoi cell. 

Now consider the case of heterogeneous spectrum an- 
alyzers. Since the scanning times of different analzyers 
are different, standard Voronoi partitioning is no longer 
optimal. Instead, the SpecNet scheduler performs a mod- 
ified version of Voronoi partitioning — equal detection 
time partitioning — where proximity is measured in terms 
of detection time rather than Euclidean distance. 

Given the non-linear and discontinuous nature of de- 
pendence of detection time on detection range (Equa- 
tion 5), to the best of our knowledge there exists no 
known exact solution to this partitioning problem. Con- 
sequently we resort to solving the problem numerically. 
The entire area of interest is sampled at several locations 
generated randomly over the area of interest. Each ran- 











dom location is then assigned to its nearest spectrum ana- 
lyzer in terms of the scan-time required to detect a trans- 
mitter at that grid point. Note that if a point is located 
beyond the detection range of a spectrum analyzer, the 
corresponding scanning time is set to infinity. Finally, 
each spectrum analyzer is assigned a resolution band- 
width by setting its detection range to the farthest ran- 
dom location assigned to it. The run-time complexity of 
this numerical scheme depends on the number of random 
points chosen. In our implementation we generated ran- 
dom locations with a density of 1 location per sq meter. 
For an area of 1 Sq Km (1 x 10° random locations) we 
found that geographic partitioning took under a few hun- 
dred milliseconds on the SpecNet server. 


5.2.2 Geographical versus Spectral Load Sharing 


Which of the above two load-sharing schemes should 
we use and under what circumstances? To answer this 
question we describe the results of two experiments con- 
ducted in the Indian Institute of Science (IISC) campus, 
depicted in Figures 10a and 10b, scanning from 700-800 
MHz. In each of the experiments we compared three 
different scheduling methods. In Best Select, the spec- 
trum analyzer that can accomplish the task in the shortest 
time is selected and used to accomplish the scanning task 
without any load sharing. We compared Best Select with 
spectral and geographical load sharing. 

Experiment I: Two identical spectrum analyzers (both 
N9320B Agilent models) were placed 103 m apart at A 
and B as depicted in Figure 10a. The region of interest 
was specified as a circle of radius 50 m. 

Experiment II : Two identical spectrum analyzers (both 
N9320B Agilent models) were both placed at location 
A and the region of interest was specified as a circle of 
radius 50 m as shown in Figure 10b. 


Experiment Best Select spectra seographica 
- = in at in — 


| Experiment! | 


Experiment | 10st —- set] 1084 — 





Table 2: Comparison of load sharing schemes 


Results of Experiment I: As depicted in Table 2, since 
the spectrum analyzers are identical, the optimal spec- 
tral load sharing resulted in both the spectrum analyz- 
ers taking an almost equal amount of time (in practice 
a slight difference in their noise floors resulted in one 
spectrum analyzer scanning a bit more spectrum than 
the other). Consequently, spectral partitioning completed 
about twice as fast as Best Select. Curiously, geographi- 
cal load sharing completed almost five times faster than 
spectral load sharing. In this particular experiment, 
Voronoi partitioning resulted in two halves of the circle 
indicated by regions R1 and R2 in Figure 10a. Conse- 
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(a) Experiment I 


(b) Experiment II 





(c) Experiment III 


Figure 10: Comparison of scheduling schemes 


quently, the detection range required for each of the spec- 
trum analyzers in geographical load sharing was smaller 
than that required in spectral load sharing. Eqn 5 reveals 
that scanning time decreases super-linearly as detection 
range, explaining the 5x gains. 

Results from Experiment II : As depicted in Table 2, 
since the spectrum analyzers are co-located and identi- 
cal, optimal spectral load sharing assigns two halves of 
the span to each spectrum analyzer. Consequently, spec- 
tral load sharing performs approximately twice as well 
as scheduling without load sharing. Here, however, geo- 
graphical load sharing performs exactly the same as hav- 
ing no load sharing and takes twice as long as spectral 
partitioning! The Voronoi partition for the experiment 
is indicated by the dashed line separating R1 and R2 in 
Figure 10b. The maximum detection range required by 
each of the two spectrum analyzers to cover their respec- 
tive partitions is actually almost the same as that required 
to cover the entire circular region of interest. Since both 
the spectrum analyzers scan the entire spectrum, one of 
the spectrum analyzers is actually redundant. This ex- 
periment shows that when spectrum analyzers are very 
closely located, spectral partitioning can be more advan- 
tageous than geographical partitioning. 


5.2.3. Geo-Spectral Load Sharing 


Spectral and geographical task sharing, as described in 
Section 5.2.1, each optimize along a single dimension 
only, namely either frequency (spectral) or area (geo- 
graphical). As seen from Experiments I and II (Sec- 
tion 5.2.2), while geographical task sharing may be su- 
perior to spectral in some scenarios, the opposite may be 
true in others. A more general task partitioning scheme 
then is geo-spectral partitioning — where optimization 1s 
performed simultaneously along both the spectral and 
geographic dimensions. 

Optimal geo-spectral task sharing, where spectrum an- 
alyzers are assigned a combination of frequency range 
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and geographical area to minimize overall task comple- 
tion time while ensuring that the entire area and spec- 
trum of interest are covered, falls under a class of non- 
convex optimization problems for which, to the best of 
our knowledge, there exists no known exact solution. 
However, Experiments I and II (Section 5.2.2) reveal 
two key observations that allow us to develop a heuristic 
to enable geo-spectral task sharing. First, geographical 
partitioning typically out-performs spectral partitioning 
owing to the super-linear relationship between detection 
range and scanning time. Second, when spectrum ana- 
lyzers are located near each other, spectral partitioning 
tends to outperform geographical partitioning. 


In order to facilitate explanation of our heuristic for 
geo-spectral task sharing, we introduce the notion of a 
spectrally sharing cluster (SSC) of spectrum analyzers — 
a set of spectrum analyzers that share their scanning tasks 
spectrally over the same geographical region (possibly 
over only a small part of the entire region of interest). An 
SSC can be replaced by a single representative Virtual 
Spectrum Analyzer (VSA). The distance of a location 
from this VSA is then the maximum over the distances 
all spectrum analyzers in the corresponding SSC, since 
even the farthest constituent spectrum analyzer must de- 
tect occupancy at this location. The occupancy detection 
time for any location using the VSA is determined by op- 
timally partitioning the spectrum among the constituent 
spectrum analyzers in the corresponding SSC (as de- 
scribed in Section 5.2.1). The union of two SSCs yields 
a VSA comprising the union of all constituent spectrum 
analyzers in both SSCs. 


Our geo-spectral task sharing heuristic for n spectrum 
analyzers is initialized by creating n SSCs, each com- 
prising a single distinct spectrum analyzer and perform- 
ing geographical task sharing on them. The algorithm 
is a greedy iterative scheme, where at each step, pair- 
wise SSC unions are considered in order to determine if 
overall task completion time can be reduced. In order 
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Frequency Range | __RBW steps 


Agilent N9320B 9 KHz- 3 GHz 11 (10 Hz - 1 MHz) 
Agilent Fieldfox N9912A 5 KHz - 6 GHz 36 (10 Hz - 1 MHz) 
9 KHz - 26.5 GHz 62 (1 Hz - 8 MHz) 
3 Hz - 26.5 GHz 68 (1 Hz - 8 MHz) 
9 KHz- 3 GHz 15 (10 Hz - 5 MHz) 


Agilent EXA N9010A 
Agilent PSA E4440A 
Hewlett-Packard E4403B 





Table 3: Spectrum analyzer models used in SpecNet 


to determine overall task completion time given a set of 
SSCs, each SSC is replaced by its corresponding VSA 
and geographical partitioning is performed on this set of 
VSAs. The SSC pair union that results in the maximum 
reduction in overall task completion time is accepted for 
the next iterative step. The procedure continues until no 
further opportunities to unite SSCs exist that can reduce 
the overall task completion time. In the worst case, the 
algorithm terminates in n steps, as at each step the num- 
ber of SSCs decreases by 1. As, at each step all pairs 
of SSCs must be explored, the worst-case running time 
of this algorithm is O(n?). Since spectral sharing typi- 
cally yields benefits only when two spectrum analyzers 
are “close”, in practice the running time can be reduced 
to O(n?) by considering a fixed number of closest SSCs 
rather than all possible SSC pairs at each step. 

Figure 10c depicts an example of Geo-Spectral load 
sharing. The scanning frequency range was chosen as 
700 MHz to 800 MHz. Spectrum analyzers S1, S2 and 
S3 are located at A, B and C respectively. S3 (Fieldfox) 
is a much faster spectrum analyzer compared to S1 and 
S2 (both N9320B Agilent). The circular region of in- 
terest is geographically partitioned into two regions R1 
and R2. Sl and S2 scan region R1 using spectral load 
sharing while S3 scans the entire spectrum in geographic 
region R2. To compare the performance of geo-spectral 
partitioning we also tried scheduling using the purely ge- 
ographic and spectral schemes. Geographic load sharing 
took 1205 seconds; spectral load sharing 1118 seconds; 
and geo-spectral load sharing only 526 seconds. 

In summary, load sharing across multiple spectrum 
analyzers is a challenging problem. SpecNet’s Geo- 
Spectral load sharing algorithm is able to achieve 2-5X 
speedup compared to using a single spectrum analyzer in 
our experiments. 


6 Implementation 

The SpecNet platform is accessible at [15] via a web ser- 
vice API. It consists of a master server that manages sev- 
eral slave servers. 


6.1 Master Server 
The master server performs two major functions— 
first, it exposes an API (Section 4) which the Spec- 
Net clients/users utilize to write programs and second, 
it manages all the slave servers connected to it. 

As mentioned in Section 4, the API is exposed as 
XML-RPC calls to allow access from a wide-range of 


platforms. The master server implements a push-based 
model and thus, TCP connections to the slave servers are 
kept persistent using heartbeats. The current implemen- 
tation of the master server is centralized and consists of 
approximately 5000 lines of C# code. However, parti- 
tioning of the slave servers along geographic boundaries 
is possible, thus allowing distributed execution across 
multiple master servers if scalability concerns arise. 

One of the key challenges in managing slave servers 
is dealing with the heterogeneity of spectrum analyz- 
ers. As shown in Table 3, spectrum analyzers differ 
in their supported resolution bandwidth steps and fre- 
quency range of operation. Further, as discussed earlier, 
scan times (Figure 5) and noise floor (Figure 2) also vary 
across spectrum analyzers. SpecNet accounts for each of 
the above variations through a novel, automatic remote 
benchmarking process, described in detail in [8], that al- 
lows the master server to quickly build up a lookup table 
of scan times and noise floor values at different resolu- 
tion bandwidth steps for each of its slave servers. 


6.2 Slave Servers 

The slave server is a small piece of software that runs 
on a desktop or laptop that are directly connected to the 
spectrum analyzer. The main task of the slave server is to 
act as a bridge between the spectrum analyzer connected 
to it and the master server. To avoid issues with NAT/- 
firewalls, the slave server initiates an outbound TCP con- 
nection on port 22 to the master server. It also connects 
to the local spectrum analyzer through VISA. Once con- 
nected, it translates commands from the master server 
to the spectrum-analyzer-specific-commands, runs spec- 
trum scans, and returns the results. 

In order to support multiple platforms, we have im- 
plemented the slave server in Python in approximately 
1000 lines of code. We use the PyInstaller [13] package 
to generate platform specific (Windows & Linux as of 
today) executables. 


7 Applications 
In this section, we present three example user applica- 
tions on the SpecNet platform that highlight the simplic- 
ity of building a networked, geo-distributed system of 
spectrum analyzers. 


7.1 Remote Spectrum Measurement 


In this section we demonstrate how SpecNet can be used 
to make spectrum measurements anywhere in the world. 
The user code fragment written in Python is shown in 
Listing 1. One simply needs to connect to the SpecNet 
server, identify available devices in the region of interest 
and then use the Get PowerSpectrum() API to ob- 
tain power values in the desired parts of the spectrum. 
This data can be used, for example, to compare avail- 
able free spectrum in different parts of the world or as 
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Figure 11: Spectrum occupancy in various geographic regions 


traces for evaluation of new white-space protocols such 
as WhiteFI [3]. 


Listing 1: Code snippet for remote measurement. 


# connect to SpecNet server 

apiServer = xmlrpclib.ServerProxy ( 
"http://bit.ly/SpecNetAPI", 
allow_none=True); 


# Find devices from region of interest 


devices = APIServer.GetDevices ( 
[55.944350, -3.187745, 500.0], None); 
for device in devices: 


power vals = APIServer.GetPowerSpectrum( 
device[’ID’], Fs, Fe, 1e3); 


At the time of writing, in addition to a few spectrum 
analyzers in Bangalore (India), we had one spectrum an- 
alyzer in Stony Brook (USA) and one in Edinburgh (UK) 
that were connected to SpecNet. Figure 11 shows the 
spectrum measurements at these three sites located in 
three different continents, demonstrating the world-wide 
reach of the SpecNet platform. As seen from Figure 11, 
spectrum measurements at each of these locations across 
the world clearly identify the well-known transmitters 
such as FM, TV, efc., and the available spectrum whites- 
paces. 


7.2 Primary Coverage 


The next example application determines the spatial foot- 
print of a TV transmitter located within a large city. This 
may be useful for whitespace network operators in plan- 
ning their deployments. Determining the footprint of a 
TV transmitter invariably requires knowledge of its lo- 
cation. While accurate databases of these locations are 
available in countries such as the US, such a database 
is not readily available in many developing countries, in- 
cluding India. We tried to obtain this information by con- 
tacting the Indian government agencies via postal mail 
(under the Right-to-Information Act). While we received 
information on about 150 TV tower locations (out of an 
estimated 700 towers), we found many inaccuracies in 
the data. For example, one tower’s location was mapped 
well into a bay! Upon analyzing this TV tower data for 
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five cities (ground truth based on Wikimapia), we found 
localization errors to range between 2-83 km (average 
22 km, median 5 km). We now highlight how SpecNet 
could be used as a low-cost solution to improve the cov- 
erage and accuracy of the existing TV tower database. 


Listing 2: Code snippet for primary coverage. 


# Get Spectrum Analyzers in region 
area of interest = [13.02236,77.56558, 100000.0]; 
devices = APIServer.GetDevices(area_of interest, None) 


# Get Power Spectrum Values 
for device in devices: 
power vals = APIServer.GetPowerSpectrum( 
device[’ID’], Fs, Fe, 1e3); 
power vals.append(average(power vals) ); 
observation _locations.append([device[’latitude’], 
device[’longitude’]]); 
# Localize 
if number of locations < 5 
localization_res = APIServer.LocalizeTransmitter ( 
area_of interest, observation locations, 
power valnes, “LDPL’, [=35«0, 3.0]); 
else 
localization_res = APIServer.LocalizeTransmitter ( 
area_of interest, observation locations, 
power values, ‘LDPL’, None); 


# Interpolate 
pow = APIServer.FindPowerAtLocation(new_location, 
[localization_res], ‘LDPL’, None); 


The code snippet for this application is shown in List- 
ing 2. The region of interest is identified and power 
spectrum values from devices in that region are ob- 
tained. Then the TV transmitter is localized using the 
LocalizeTransmitter() API. Finally, a path loss 
model is used to build the spatial footprint of the TV 
transmitter. The API FindPowerAtLocation() is 
then used to determine the received power at desired new 
locations. 

Bangalore city has one terrestrial TV transmitter. For 
the purpose of evaluation in a large-scale setting, we 
needed data from multiple spectrum analyzers at differ- 
ent locations in the city. Also, the accuracy of the lo- 
calization API depends on the number of measurement 
locations. However, at the time of evaluation we only 
had access to four slave servers inside Bangalore. To get 
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around this problem, we modified the master server to 
allow mobile slave servers to connect to it. This enabled 
us to gather data from multiple locations in the city us- 
ing just one mobile slave server by driving on the major 
roads and highways of the city. Figure 12 depicts the lo- 
cations in the city where measurements were collected. 


Figure 13 shows the TV tower localization error mean, 
25th and 75th percentile (y-axis) as the number of mea- 
surement locations are varied (x-axis). To generate each 
point in Figure 13, twenty subsets of locations were ran- 
domly picked from the set of all measurement locations. 
We see that even when the number of measurement loca- 
tions is between 5-10, the mean localization error varies 
between 2.5-3.8 km. This demonstrates that even by us- 
ing measurements from a small number of spectrum an- 
alyzers in each city, the gaps and inaccuracies in the gov- 
ernment database can be corrected significantly.! As the 
number of measurement locations is increased to 100, 
we see that the localization error goes below 0.5 km. 
While it is unrealistic to assume that SpecNet would have 
over 100 spectrum analyzers in each city, an alternative 
is to have spectrum analzyers that are mobile as part of 
SpecNet— we plan to look into this in the future. 

Figure 14 shows the mean, 25th and 75th percentile er- 
rors in signal strength predictions obtained by using the 
interpolation API. The mean signal error varies between 
6 to 8 dB, similar in magnitude to the expected signal 
variations due to the environment.” Thus, using SpecNet 
to calculate coverage of a primary transmitter can pro- 
vide a good estimate to an operator. 


7.3. SpectrumCop 


Our final application demonstrates the two key features 
of SpecNet: 1) simplicity of writing a complex real-time 
application through the use of high-level APIs and 2) ef- 
ficiency of SpecNet in scanning a wide frequency range 
when more than one spectrum analyzer is available, in 
order to detect violators quickly. 

The goal of this application is to quickly detect a static 
narrow-band transmitter within a certain geographical re- 
gion of interest and then localize the transmitter. The 
transmitter can be operating anywhere within a wide fre- 
quency range. This application is especially useful for, 
say, government officials to monitor unauthorized trans- 
mitters in a certain band. 

The code snippet for this application is shown in List- 
ing 3. The application uses the GetOccupancy() API 
for the transmitter detection part, which basically tasks 


'Note that we used basic triangulation to locate the T.V tower, it 
may be possible to achieve a higher accuracy through more sophisti- 
cated localization schemes proposed in literature. 

*In our implementation we used a simple log distance path loss 
model. The use of more sophisticated path loss models such as those 
that use terrain information may provide more accurate predictions 





one or more spectrum analyzers in the vicinity to per- 
form scans at an appropriate resolution bandwidth and 
frequency range. The result of this API call is an occu- 
pancy list, which indicates frequencies that have ongoing 
transmissions. A more detailed spectrum measurement 
is then performed only in the region around the detected 
frequency. The results of the scan are then fed to the 
LocalizeTransmitter() API to determine the lo- 
cation of the transmitter. 


Listing 3: Code snippet for SpectrumCop. 


# Find occupancy in desired region 

bound = [lat, lng, radius]; 

options = [lat, lng, radius, min_power_to detect]; 

occupancy list = APIServer.GetOccupancy(bound, 
start frequency, end frequency, min_power detect); 


# Get power spectrum for transmitter frequency 
for occupancy in occupancy list: 
if (occupancy[’Occupied’] == 1): 
new _f start = occupancy[’Frequency’] - 250e3; 
new _f end = occupancy[’Frequency’] + 250e3; 
devices = APIServer.GetDevices(bound, None); 
for device in devices: 
locs.append([device[ ‘Latitude’ ], 
device[ ’Longitude’]]); 
results[device[’ID’]] = APIServer. 


GetPowerSpectrum(device[’ID’'], 

new_f start, new _f end, 

options); # Actual call in new thread. 
break; 


# Localize transmitter based on power measurements 
for r in results: 
powers.append(max(r)); 
print APIServer.LocalizeTransmitter(bounds, locs, 
powers, 'LDPL’, [P, 3.0]); 


Evaluation: We used this application to detect and lo- 
calize a microphone in a region of 75 meters radius in 
IISc. The setup consisted of 3 spectrum analyzers that 
were placed near 3 corners of the region of interest. The 
microphone transmits in a 250 KHz narrow band and the 
frequency range of the search space is set to 3 MHz. The 
SpectrumCop application detected the microphone per- 
fectly and localized it to within 20 meters of the actual 
location. The entire process of detecting and locating the 
microphone took 165 seconds. 


$ Limitations 


First, spectrum analyzers are expensive equipment that 
researchers have procured for specific needs. It may not 
be easy to convince owners to volunteer this resource to 
the community, especially during the bootstrapping stage 
where the benefit of the platform is not clear to the owner. 
To date, we have approached a few of our acquaintances 
and have observed mixed results. In the long run, per- 
haps governments may be willing to sponsor a set of 
spectrum analyzers dedicated for SpecNet use. 

Second, spectrum analyzers are typically used inside 
labs that may be in basements or deep inside buildings. 
Our measurements indicate that buildings can add 5-20 
dB of attenuation (20dB in the basement for FM/TV 
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Figure 12: Measurement locations 


transmissions) which restricts the detection range of the 
analyzer. If the owner can be convinced to mount the an- 
tenna near a window, the utility of the spectrum analyzer 
can be significantly increased. To minimize variability 
due to antenna placements, SpecNet can choose to only 
include spectrum analyzers with unobstructed antennas. 


Finally, we have not considered the privacy/security 
implications of allowing remote scanning of the spec- 
trum. For now, SpecNet only exposes the power values 
measured from the spectrum scan. Thus, it prevents di- 
rect security and privacy threats such as fine-grained traf- 
fic monitoring or user tracking. Advanced spectrum ana- 
lyzers can provide time domain (I/Q) samples of the scan 
and support for these features in SpecNet would require 
sophisticated controls for privacy and security. 


9 Conclusion 


After the FCC ruling in the U.S. allowing opportunistic 
access to portions of licensed frequency bands, there has 
been tremendous interest in both academia and indus- 
try in developing novel wireless techniques and products 
that take advantage of the new rules. A key requirement 
for enabling this new ecosystem is a measurement infras- 
tructure that can provide real data. SpecNet fulfills this 
critical need by enabling geographically distributed spec- 
trum analyzers to be networked, thereby allowing both 
real-time remote measurements as well as collection of 
historic spectrum usage data. Furthermore, SpecNet ex- 
poses an API that allows users to build interesting dis- 
tributed sensing applications like SpectrumCop with rel- 
ative ease. There is still a lot of work left to achieve our 
goal of building a planet-scale networked spectrum an- 
alyzer testbed, but we believe SpecNet provides a good 
base to build upon. 
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Figure 13: TV Tower Localization 


Figure 14: Interpolation results 
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Abstract 


A highly accurate client-independent geolocation service 
stands to be an important goal for the Internet. Despite an 
extensive research effort and significant advances in this 
area, this goal has not yet been met. Motivated by the fact 
that the best results to date are achieved by utilizing ad- 
ditional ’hints’ beyond inherently inaccurate delay-based 
measurements, we propose a novel geolocation method 
that fundamentally escalates the use of external informa- 
tion. In particular, many entities (e.g., businesses, uni- 
versities, institutions) host their Web services locally and 
provide their actual geographical location on their Web- 
sites. We demonstrate that the information provided in 
this way, when combined with network measurements, 
represents a precious geolocation resource. Our method- 
ology automatically extracts, verifies, utilizes, and op- 
portunistically inflates such Web-based information to 
achieve high accuracy. Moreover, it overcomes many of 
the fundamental inaccuracies encountered in the use of 
absolute delay measurements. We demonstrate that our 
system can geolocate IP addresses 50 times more accu- 
rately than the best previous system, i.e., it achieves a 
median error distance of 690 meters on the correspond- 
ing data set. 


1 Introduction 


Determining the geographic location of an Internet host 
is valuable for a number of Internet applications. For ex- 
ample, it simplifies network management in large-scale 
systems, helps network diagnoses, and enables location- 
based advertising services [17,24]. While coarse-grained 
geolocation, e.g., at the state- or city-level, is sufficient in 
a number of contexts [19], the need for a highly accurate 
and reliable geolocation service has been identified as an 
important goal for the Internet (e.g., [17]). Such a sys- 
tem would not only improve the performance of existing 
applications, but would enable the development of novel 
ones. 


Daniel Burgener 
Northwestern University 


Marcel Flores 
Northwestern University 


Cheng Huang 
Microsoft Research 


While client-assisted systems capable of providing 
highly accurate IP geolocation inferences do exist [3,5, 
9], many applications such as location-based access re- 
strictions, context-aware security, and online advertising, 
can not rely on clients’ support for geolocation. Hence, 
a highly accurate client-independent geolocation system 
stands to be an important goal for the Internet. 


An example of an application that already extensively 
uses geolocation services, and would significantly ben- 
efit from a more accurate system, is online advertising. 
For example, knowing that a Web user is from New York 
is certainly useful, yet knowing the exact part of Man- 
hattan where the user resides enables far more effective 
advertising, e.g., of neighboring businesses. On the other 
side of the application spectrum, example services that 
would benefit from a highly accurate and dependable ge- 
olocation system, are the enforcement of location-based 
access restrictions and context-aware security [2]. Also 
of rising importance is cloud computing. In particular, 
in order to concurrently use public and private cloud im- 
plementations to increase scalability, availability, or en- 
ergy efficiency (e.g., [22]), a highly accurate geolocation 
system can help select a properly dispersed set of client- 
hosted nodes within a cloud. 


Despite a decade of effort invested by the network- 
ing research community in this area, e.g., [12, 15-19], 
and despite significant improvements achieved in recent 
years (e.g., [17, 24]), the desired goal, a geolocation 
service that would actually enable the above applica- 
tions, has not yet been met. On one hand, commercial 
databases currently provide rough and incomplete loca- 
tion information [17,21]. On the other hand, the best 
result reported by the research community (to the best 
of our knowledge) was made by the Octant system [24]. 
This system was able to achieve a median estimation er- 
ror of 22 miles (35 kilometers). While this is an ad- 
mirable result, as we elaborate below, it is still insuffi- 
cient for the above applications. 


The key contribution of our paper lies in designing a 
novel client-independent geolocation methodology and 
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in deploying a system capable of achieving highly accu- 
rate results. In particular, we demonstrate that our system 
can geolocate IP addresses with a median error distance 
of 690 meters in an academic environment. Comparing 
to recent results on the same dataset shows that we im- 
prove the median accuracy by 50 times relative to [24] 
and by approximately 100 times relative to [17]. Im- 
provements at the tail of the distribution are even more 
significant. 

Our methodology is based on the following two in- 
sights. First, many entities host their Web services lo- 
cally. Moreover, such Websites often provide the actual 
geographical location of the entity (e.g., business and 
university) in the form of a postal address. We demon- 
strate that the information provided in this way repre- 
sents a precious resource, i.e., it provides access to a 
large number of highly accurate landmarks that we can 
exploit to achieve equally accurate geolocation results. 
We thus develop a methodology that effectively mines, 
verifies, and utilizes such information from the Web. 

Second, while we utilize absolute network delay mea- 
surements to estimate the coarse-grained area where an 
IP is located, we argue that absolute network delay mea- 
surements are fundamentally limited in their ability to 
achieve fine-grained geolocation results. This is true in 
general even when additional information, e.g., network 
topology [17] or negative constraints such as uninhabit- 
able areas [24], is used. One of our key findings, how- 
ever, is that relative network delays still heavily correlate 
with geographical distances. We thus fully abandon the 
use of absolute network delays in the final step of our ap- 
proach, and show that a simple method that utilizes only 
relative network distances achieves the desired accuracy. 

Combining these two insights into a single methodol- 
ogy, we design a three-tier system which begins at the 
large, coarse-grained scale, first tier where we utilize a 
distance constraint-based method to geolocate a target IP 
into an area. At the second tier, we effectively utilize a 
large number of Web-based landmarks to geolocate the 
target IP into a much smaller area. At the third tier, we 
opportunistically inflate the number of Web landmarks 
and demonstrate that a simple, yet powerful, closest node 
selection method brings remarkably accurate results. 

We extensively evaluate our approach on three dis- 
tinct datasets — Planetlab, residential, and an online maps 
dataset — which enables us to understand how our ap- 
proach performs on an academic network, a residential 
network, and in the wild. We demonstrate that our algo- 
rithm functions well in all three environments, and that it 
is able to locate IP addresses in the real world with high 
accuracy. The median error distances for the three sets 
are 0.69 km, 2.25 km, and 2.11 km, respectively. 

We demonstrate that factors that influence our sys- 
tem’s accuracy are: (2) Landmark density, i.e., the more 
landmarks there are in the vicinity of the target, the bet- 
ter accuracy we achieve. (22) Population density, i.e., the 
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more people live in the vicinity of the target, the higher 
probability we obtain more landmarks, the better accu- 
racy we achieve. (222) Access technology, i.e., our sys- 
tem has slightly reduced accuracy (by approximately 700 
meters) for cable users relative to DSL users. While our 
methodology effectively resolves the last mile delay in- 
flation problem, it is necessarily less resilient to the high 
last-mile latency variance, common for cable networks. 

Given that our approach utilizes Web-based landmark 
discovery and network measurements on the fly, one 
might expect that the measurement overhead (crawling in 
particular) hinders its ability to operate in real time. We 
show that this is not the case. In a fully operational net- 
work measurement scenario, all the measurements could 
be done within 1-2 seconds. Indeed, Web-based land- 
marks are stable, reliable, and long lasting resources. 
Once discovered and recorded, they can be reused for 
many measurements and re-verified over longer time 
scales. 


2 A Three-Tier Methodology 


Our overall methodology consists of two major compo- 
nents. The first part is a three-tier active measurement 
methodology. The second part is a methodology for 
extracting and verifying accurate Web-based landmarks. 
The geolocation accuracy of the first part fundamentally 
depends on the second. For clarity of presentation, in this 
section we present the three-tier methodology by simply 
assuming the existence of Web-based landmarks. In the 
next section, we provide details about the extraction and 
verification of such landmarks. 

We deploy the three-tier methodology using a dis- 
tributed infrastructure. Motivated by the observation that 
the sparse placement of probing vantage points can avoid 
gathering redundant data [26], we collect 163 publicly 
available ping and 136 traceroute servers geographically 
dispersed at major cities and universities in the US. 


2.1 Tierl 


Our final goal is to achieve a high level of geolocation 
precision. We achieve this goal gradually, in three steps, 
by incrementally increasing the precision in each step. 
The goal of the first step is to determine a coarse-grained 
region where the targeted IP is located. In an attempt 
not to reinvent the wheel,’ we use a variant of a well es- 
tablished constrained-based geolocation (CBG) method 
[15], with minor modifications. 

To geolocate the region of an IP address, we first send 
probes to the target from the ping servers, and convert 
the delay between each ping server and the target into a 
geographical distance. Prior work has shown that pack- 
ets travel in fiber optic cables at 2/3 the speed of light 
in a vacuum (denoted by c) [20]. However, others have 
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Figure 1: An example of intersection created by distance 
constraints 


demonstrated that 2/3 c is a loose upper bound in practice 
due to transmission delay, queuing delay etc. [15, 17]. 
Based on this observation, we adopt 4/9 c from [17] as 
the converting factor between measured delay and geo- 
graphical distance. We also demonstrate in Section 4, 
by using this converting factor, we are always capable of 
yielding a viable area covering the targeted IP. 

Once we establish the distance from each vantage 
point, i.e., ping server, to the target, we use multilater- 
ation to build an intersection that covers the target using 
known locations of these servers. In particular, for each 
vantage point, we draw a ring centered at the vantage 
point, with a radius of the measured distance between 
the vantage point and the target. As we show in Section 
4, this approach indeed allows us to always find a region 
that covers the targeted IP. 

Figure | illustrates an example. It geolocates a col- 
lected target (we will elaborate the way of collecting the 
targets in the wild in Section 4.1.2) whose IP address 
is 38.100.25.196 and whose postal address is ’1850, K 
Street NW, Washington DC, DC, 20006’. We draw rings 
centered at the locations of our vantage points. The ra- 
dius of each ring is determined by the measured distance 
between the vantage point (the center of this ring) and the 
target. Finally, we geolocate this IP in an area indicated 
by the shaded region, which covers the target, as shown 
in Figure 1. 

Thus, by applying the CBG approach, we manage to 
geolocate a region where the targeted IP resides. Ac- 
cording to [17,24], CBG achieves a median error be- 
tween 143km and 228km distance to the target. Since 
we strive for a much higher accuracy, this is only the 
starting point for our approach. To that end, we depart 
from pure delay measurements and turn to the use of ex- 
ternal information available on the Web. Our next goal is 
to further determine a subset of ZIP Codes, i.e., smaller 
regions that belong to the bigger region found via the 





MM Vantage Point /\ Target 


©) Landmark % Router 
Figure 2: An example of measuring the delay between 
landmark and target 


CBG approach. Once we find the set of ZIP Codes, we 
will search for additional websites served within them. 
Our goal is to extract and verify the location information 
about these locally-hosted Web services. In this way, we 
obtain a number of accurate Web-based landmarks that 
we will use in Tiers 2 and 3 to achieve high geolocation 
accuracy. 


To find a subset of ZIP Codes that belong to the given 
region, we proceed as follows. We first determine the 
center of the intersection area. Then, we draw a ring 
centered in the intersection center with a diameter of 5 
km. Next, we sample 10 latitude and longitude pairs at 
the perimeter of this ring, by rotating by 36 degrees be- 
tween each point. For the 10 initial points, we verify that 
they belong to the intersection area as follows. Denote 
by U the set of latitude and longitude pairs to be verified. 
Next, denote by V the set of all vantage points, i.e., ping 
servers, with known location. Each vantage point v; 1s 
associated with the measured distance between itself and 
the target, denoted by r;. We wish to find all w € U that 
satisfy 


distance(u, v;) < 7; for all vu; E V 


The distance function here is the great-circle distance 
[23], which takes into account the earth’s sphericity and 
is the shortest distance between any two points on the 
surface of the earth measured along a path on the surface 
of the earth. We repeat this procedure by further obtain- 
ing 10 additional points by increasing the distance from 
the intersection center by 5 km in each round (i.e., to 10 
km in the second round, 15 km in the third etc.). The 
procedure stops when not a single point in a round be- 
longs to the intersection. In this way, we obtain a sample 
of points from the intersection, which we convert to ZIP 
Codes using a publicly available service [4]. Thus, with 
the set of ZIP Codes belonging to the intersection, we 
proceed to Tier 2. 
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2.2 Tier 2 


Here, we attempt to further reduce the possible region 
where the targeted IP is located. To that end, we aim to 
find Web-based landmarks that can help us achieve this 
goal. We explain the methodology for obtaining such 
landmarks in Section 3. Although these landmarks are 
passive, i.e., we cannot actively send probes to other In- 
ternet hosts using them, we use the traceroute program to 
indirectly estimate the delay between landmarks and the 
target. 

Learning from [11] that the more traceroute servers 
we use, the more direct a path between a landmark and 
the target we can find, we first send traceroute probes to 
the landmark (the empty circle in Figure 2) and the tar- 
get (the triangle in Figure 2) from all traceroute servers 
(the solid squares V; and V2 in Figure 2). For each van- 
tage point, we then find the closest common router to 
the target and the landmark, shown as R, and Ro in 
Figure 2, on the routes towards both the landmark and 
the target. Next, we calculate the latency between the 
common router and the landmark (D, and Dz in Fig- 
ure 2) and the latency between the common router and 
the target (D2 and Dy, in Figure 2). We finally select 
the sum (LD) of two latencies as the delay between land- 
mark and target. In the example above, from V;’s point 
of view, the delay D between the target and the landmark 
is D = D, + Do, while from V2’s perspective, the delay 
DisD= D34+ D4. 

Since different traceroute servers have different routes 
to the destination, the common routers are not necessar- 
ily the same for all traceroute servers. Thus, each van- 
tage point (a traceroute server) can estimate a different 
delay between a Web-based landmark and the target. In 
this situation, we choose the minimum delay from all 
traceroute servers’ measurements as the final estimation 
of the latency between the landmark and the target. In 
Figure 2, since the path between landmark and target 
from V;’s perspective is more direct than that from V’s 
(D, + D2 < D3 + D4), we will consider the sum of D, 
and Ds (D; + Dez) as the final estimation. 

Routers in the Internet may postpone responses. Con- 
sequently, if the delay on the common router is inflated, 
we may underestimate the delay between landmark and 
target. To examine the ’quality’ of the common router we 
use, we first traceroute different landmarks we collected 
previously and record the paths between any two land- 
marks, which also branch at that router. We then calcu- 
late the great circle distance [23] between two landmarks 
and compare it with their measured distance. If we ob- 
serve that the measured distance is smaller than the cal- 
culated great circle distance for any pair of landmarks, 
we label this router as ’inflating’, record this informa- 
tion, and do not consider its path (and the corresponding 
delay) for this or any other measurement. 

Through this process, we can guarantee that the esti- 
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Figure 3: An example of shrinking the intersection 


mated delay between a landmark and the target is not un- 
derestimated. Nonetheless, such estimated delay, while 
converging towards the real latency between the two en- 
tities, is still usually larger. Hence, it can be considered 
as the upper bound of the actual latency. Using multilat- 
eration with the upper bound of the distance constraints, 
we further reduce the feasible region using the new tier 2 
and the old tier 1 constraints. 

Figure 3 shows the zoomed-in subset of the con- 
Strained region together with old tier 1 constraints, 
marked by thick lines, and new tier 2 constraints, marked 
by thin lines. The figure shows a subset of sampled land- 
marks, marked by the solid dots, and the IP that we aim 
to geolocate, marked by a triangle. The tier 1 constrained 
area contains 257 distinctive ZIP Codes, in which we are 
able to locate and verify 930 Web-based landmarks. In 
the figure, we show only a subset of 161 landmarks for 
a clearer presentation. Some sampled landmarks lie out- 
side the original tier 1 level intersection. This happens 
because the sampled ZIP Codes that we discover at the 
borders of the original intersection area typically spread 
outside the intersection as well. Finally, the figure shows 
that the tier 2 constrained area is approximately one order 
of magnitude smaller than the original tier 1 area. 


2.3 Tier3 


In this final step, our goal is to complete our geoloca- 
tion of the targeted IP address. We start from the region 
constrained in Tier 2, and aim to find all ZIP Codes in 
this region. To this end, we repeat the sampling proce- 
dure deployed in the Tier 2. This time from the center of 
the Tier 2 constrained intersection area, and at a higher 
granularity. In particular, we extend the radius distance 
by 1 km in each step, and apply a rotation angle of 10 
degrees. Thus, we achieve 36 points in each round. We 
apply the same stopping criteria, i.e., when no points in 
a round belong to the intersection. This finer-grain sam- 
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Figure 4: An example of associating a landmark with the 
target as the result 


pling process enables us to discover all ZIP Codes in the 
intersection area. For ZIP Codes that were not found in 
the previous step, we repeat the landmark discovery pro- 
cess (Section 3). Moreover, to obtain the distance es- 
timations between newly discovered landmarks and the 
target, we apply the active probing traceroute process ex- 
plained above. 

Finally, knowing the locations of all Web-based land- 
marks and their estimated distances to the target, we se- 
lect the landmark with the minimum distance to the tar- 
get, and associate the target’s location with it. While this 
approach may appear ad hoc, it signifies one of the key 
contributions of our paper. We find that on the smaller- 
scale, relative distances are preserved by delay measure- 
ments, overcoming many of fundamental inaccuracies 
encountered in the use of absolute measurements. For 
example, a delay of several milliseconds, commonly seen 
at the last mile, could place an estimate of a scheme that 
relies on absolute delay measurements hundreds of kilo- 
meters away from the target. On the contrary, select- 
ing the closest node in an area densely populated with 
landmarks achieves remarkably accurate estimates, as we 
show below in our example case, and demonstrate sys- 
tematically in Section 4 via large-scale analysis. 

Figure 4 shows the striking accuracy of this approach. 
We manage to associate the targeted IP location with a 
landmark which is ’across the street’, i.e., only 0.103 km 
distant from the target. We analyze this result in more 
detail below. Here, we provide the general statistics for 
the Tier 3 geolocation process. In this last step, we dis- 
cover 26 additional ZIP Codes and 203 additional land- 
marks in the smaller Tier 2 intersection area. We then 
associate the landmark, which is at ’1776 K Street North- 
west, Washington, DC’ and has a measured distance of 
10.6 km, yet a real geographical distance of 0.103 km, 
with the target. To clearly show the association, Figure 4 
zooms into a very finer-grain street level in which the 
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Figure 5: Measured distance vs. geographical distance. 


constrained rings and relatively more distant landmarks 
are not shown. 


2.3.1 The Power of Relative Network Distance 


Here, we explore how the relative network distance ap- 
proach achieves such good results. Figure 5 sheds more 
light on this phenomenon. We examine the 13 landmarks 
within 0.6 km of the target shown in Figure 4. For each 
landmark, we plot the distance between the target and the 
Web-based landmarks (y-axis) (measured via the tracer- 
oute approach) as a function of the actual geographical 
distance between the landmarks and the target (x-axis). 
The first insight from the figure is that there is indeed 
a significant difference between measured distance, i.e., 
their upper bounds, and the real distances. This is not 
a surprise. A path between a landmark, over the com- 
mon router, to the destination (Figure 2) can often be cir- 
cuitous and inflated by queuing and processing delays, 
as demonstrated in [17]. Hence, the estimated distance 
dramatically exceeds the real distance, by approximately 
three orders of magnitude in this case. 


However, Figure 5 shows that the distance estimated 
via network measurements (y-axis) is largely in propor- 
tion with the actual geographical distance. Thus, de- 
spite the fact that the direct relationship between the 
real geographic distance and estimated distance is in- 
evitably lost in inflated network delay measurements, 
the relative distance is largely preserved. This is be- 
cause the network paths that are used to estimate the 
distance between landmarks and the target share vastly 
common links, hence experience similar transmission- 
and queuing-delay properties. Thus, selecting a land- 
mark with the smallest delay is an effective approach, 
as we also demonstrate later in the text. 
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3 Extracting and Verifying Web-Based 
Landmarks 


Many entities, e.g., companies, academic institutions, 
and government offices, host their Web services locally. 
One implication of this setup is that the actual geographic 
addresses, (in the form of a street address, city, and ZIP 
Code), which are typically available at companies’ and 
universities’ home Web pages, correspond to the actual 
physical locations where these services are located. Ac- 
cordingly, the geographical location of the correspond- 
ing web-servers’ IP addresses becomes available, and 
the servers themselves become viable geolocation land- 
marks. Indeed, we have demonstrated above that such 
Web-based landmarks constitute an important geoloca- 
tion resource. In this section, we provide a compre- 
hensive methodology to automatically extract and verify 
such landmarks. 


3.1 Extracting Landmarks 


To automatically extract landmarks, we mine numerous 
publicly available mapping services. In this way, we are 
able to associate an entity’s postal address with its do- 
main name using such mapping services. Note that the 
use of online mapping services is a convenience, not a 
requirement for our approach. Indeed, the key resource 
that our approach relies upon is the existence of geo- 
graphical addresses at locally hosted websites, which can 
be accessed directly at locally hosted websites. 


In order to discover landmarks in a given ZIP Code, 
which is an important primitive of our methodology ex- 
plained in Section 2 above, we proceed as follows. We 
first query the mapping service by a request that consists 
of the desired ZIP Code and a keyword, i.e., *business’, 
university’, and ’government office’. The service replies 
with a list of companies, academic institutions, or gov- 
ernment offices within, or close to, this ZIP Code. Each 
landmark in the list includes the geographical location of 
this entity at the street-level precision and its web site’s 
domain name. 


As an example, a jewelry company at 55 West 47th 
Street, Manhattan, New York, NY, 10036’, with the do- 
main name www.zaktools.com, is a landmark for the ZIP 
Code 10036. For each entity, we also convert its domain 
name into an IP address to form a (domain name, IP ad- 
dress, and postal address) mapping. For the example 
above, the mapping in this case is (www.zaktools.com, 
69.33.128.114, ’55 West 47th Street, Manhattan, New 
York, NY, 10036’). A domain name can be mapped into 
several IP addresses. Initially, we map each of the IP 
addresses to the same domain name and postal address. 
Then, we verify all the extracted IP addresses using the 
methodology we present below. 
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3.2 Verifying Landmarks 


A geographic address extracted from a Web page using 
the above approach may not correspond to the associated 
server’s physical address for several reasons. Below, we 
explain such scenarios and propose verification methods 
to automatically detect and remove such landmarks. 


3.2.1 Address Verification 


The businesses and universities provided by online map- 
ping services may be the landmarks near the areas cov- 
ered by the ZIP Code, not necessarily within the ZIP 
Code. Thus, we first examine the ZIP Code in the postal 
address of each landmark. If a landmark has a ZIP Code 
different from the one we searched for, we remove it 
from the list of candidate landmarks. For example, for 
the ZIP Code 10036, a financial services company called 
Credit Suisse (www.credit-suisse.com) at ’11 Madison 
Ave, New York, NY, 10010’ is returned by online map- 
ping services as an entity near the specified ZIP Code 
10036. Using our verification procedure, we remove 
such a landmark from the list of landmarks associated 
with the 10036 ZIP Code. 


3.2.2 Shared Hosting and CDN Verification 


Additionally, a company may not always host its website 
locally. It may utilize either a CDN network to distribute 
its content or use shared hosting techniques to store its 
archives. In such situations, there is no one-to-one map- 
ping between an IP address and a postal address in both 
CDN network and shared hosting cases. In particular, 
a CDN server may serve multiple companies’ websites 
with distinct postal addresses. Likewise, in the shared 
hosting case a single IP address can be used by hundreds 
or thousands of domain names with diverse postal ad- 
dresses. Therefore, for a landmark with such character- 
istics, we should certainly not associate its geographical 
location with its domain name, and in turn its IP address. 
On the contrary, if an IP address is solely used by a sin- 
gle entity, the postal address is much more trustworthy. 
While not necessarily comprehensive, we demonstrate 
that this method is quite effective, yet additional verifi- 
cations are needed, as we explain in Section 3.2.3 below. 

In order to eliminate a bad landmark, we access its 
website using (2) its domain name and (22) its IP address 
independently. If the contents, or heads (distinguished 
by <head> and </head>), or titles (distinguished by 
<title> and </title>) returned by the two methods are 
the same, we confirm that this IP address belongs to a 
single entity. One complication is that if the first request 
does not hit the ’ final’ content, but a redirection, we will 
extract the ’real’ URL and send an additional request to 
fetch the ’ final’ content. 

Take the landmark (www.manhattanmailboxes.com) 
at °676A 9 Avenue, New York, NY, 10036’ as an ex- 
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ample. We end up with a web page showing ’access 
error’ when we access this website via its IP address, 
216.39.57.104. Indeed, searching an online shared host- 
ing check [8], we discover that there are more than 2,000 
websites behind this IP address. 


3.2.3. The Multi-Branch Verification 


One final scenario occurs often in the real world: A com- 
pany headquartered in a place where its server is also 
deployed may open a number of branches nationwide. 
Likewise, a medium size organization can also have its 
branch offices deployed locally in its vicinity. Each such 
branch office typically has a different location in a dif- 
ferent ZIP Code. Still, all such entities have the same 
domain name and associated IP addresses as their head- 
quarters. 

As we explained in Section 2, we retrieve landmarks in 
a region covering a number of ZIP Codes. If we observe 
that some landmarks, with the same domain name, have 
different locations in different ZIP Codes, we remove 
them all. For example, the Allstate Insurance Company, 
with the domain name ’www..allstate.com’ has many af- 
filiated branch offices nationwide. As a result, it shows 
up multiple times for different ZIP Codes in an intersec- 
tion. Using the described method, we manage to elimi- 
nate all such occurrences. 


3.3. Resilience to Errors 


Applying the above methods, we can remove the vast 
majority of erroneous Web landmarks. However, excep- 
tions certainly exist. One example is an entity (e.g., a 
company) without any branch offices that hosts a web- 
site used exclusively by that company, but does not lo- 
cate its Web server at the physical address available on 
the Website. In this case, binding the IP address with 
the given geographical location is incorrect, hence such 
landmarks may generate errors. Here, we evaluate the 
impact that such errors can have on our method’s accu- 
racy. Counterintuitively, we show that the larger the error 
distance is between the claimed location (the street-level 
address on a website) and the real landmark location, the 
more resilient our method becomes to such errors. In all 
cases, we demonstrate that our method poses significant 
resilience to false landmark location information. 

Figure 6 illustrates four possible cases for the rela- 
tionship between a landmark’s real and claimed location. 
The figure denotes the landmark’s real location by an 
empty circle, the landmark’s claimed location by a solid 
circle, and the target by a triangle. Furthermore, denote 
R1 as the claimed distance, i.e., the distance between the 
claimed location and the target. Finally, denote R2 as 
the measured distance between the landmark’s actual lo- 
cation and the target. 
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Figure 6: The effects of improper landmark 


Figure 6(a) shows the baseline error-free scenario. In 
this case, the claimed and the real locations are identi- 
cal. Hence, Rl = R2. Thus, we can draw a ring that is 
centered at the solid circle and is always able to contain 
the target, since the upper bound is used to measure the 
distance in Section 2.2. 


Figure 6(b) shows the case when the claimed land- 
mark’s location is different from the real location. Still, 
the real landmark is farther away from the target than the 
claimed location is. Hence, R2 > R1. Thus, we will 
draw a bigger ring with the radius of R2, shown as the 
dashed curve, than the normal case with the radius of 
R1. Thus, such an overestimate yields a larger coverage 
that always includes the target. Hence, our algorithm is 
unharmed, since the target remains in the feasible region. 

Figures 6 (c) and (d) show the scenario when the 
real landmark’s location is closer to the target than the 
claimed location is, t.e., R2 < Rl. There are two sub 
scenarios here. In the underestimate case (shown in Fig- 
ure 6(c)), the real landmark location is slightly closer to 
the target and the measured delay is only a little smaller 
than it should be. However, since the upper bound is 
used to measure the delay and convert it into distance, 
such underestimates can be counteracted. Therefore, we 
can still draw a ring with a radius of R2, indicated by 
the dashed curve, covering the target. In this case, the 
underestimate does not hurt the geolocation process. 

Finally, in the excessive underestimate case (shown 
in Figure 6), the landmark is actually quite close to the 
target and the measured delay is much smaller than ex- 
pected. Consequently, we end with a dashed curve with 
the radius of R2 that does not include the target, even 
when the upper bounds are considered. In this case, the 
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excessive underestimate leads us to an incorrect intersec- 
tion or an improper association between the landmark 
and the target (R2 < R1). We provide a proof to demon- 
strate that the excessive underestimate case is not likely 
to happen in a technical report [10], yet we omit the proof 
here due to space constraints. 


4 Evaluation 


4.1 Datasets 


We use three different datasets, Planetlab, residential, 
and online maps, as we explain below. Comparing with 
the large online maps dataset, the number of targets in 
the Planetlab and the residential datasets are relatively 
small. However, these two datasets help us gain valuable 
insights about the performance of our method in different 
environments, since the online maps dataset can contain 
both types of targets. 


4.1.1 Planetlab dataset 


One method commonly used to evaluate the accuracy of 
IP geolocation systems is to geolocate Planetlab nodes, 
e.g., [17,24]. Since the locations of these nodes are 
known publicly (universities must report the locations of 
their nodes), it is straightforward to compare the location 
given by our system with the location provided by the 
Planetlab database. We select 88 nodes from Planetlab, 
limiting ourselves to at most one node per location. Oth- 
ers (e.g., [17]) have observed errors in the given Planet- 
lab locations. Thus, we manually verify all of the nodes 
locations. 


4.1.2 Residential dataset 


Since the set of Planetlab nodes are all located on aca- 
demic networks, we needed to validate our approach on 
residential networks as well. Indeed, many primary ap- 
plications of IP geolocation target users on residential 
networks. In order to do this, we created a website, 
which we made available to our social networks, widely 
dispersed all over the US. The site automatically records 
users’ IP addresses and enables them to enter their postal 
address and the access provider. In particular, we enable 
six selections for the provider: AT&T, Comcast, Veri- 
zon, other ISPs, University, and Unknown. Moreover, 
we explicitly request that users not enter their postal ad- 
dress if they are accessing this website via proxy, VPN, 
or if they are unsure about their connection. We then dis- 
tribute the link to many people via our social networks, 
and obtained 231 IP address and location pairs. 

Next, we eliminate duplicate IPs, ’dead’ IPs that are 
not accessible over the course of the experiment, which 
is one-month after the data was collected. We also elim- 
inate a large number of IPs with access method *univer- 
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Figure 7: The distribution of the population density of 
three datasets 


sity’ or ’unknown’, since we intend to extract residen- 
tial IPs and compare with those of academic IPs in Sec- 
tion 4.2. After elimination, we are left with 72 IPs. 


4.1.3 Online Maps dataset 


We obtained a large-scale query trace from a popular on- 
line maps service. This dataset contains three-months of 
users’ search logs for driving directions.'_ Each record 
consists of the user access IP address, local access time 
at user side, user browser agent, and the driving sequence 
represented by two pairs of latitude and longitude points. 
Our hypothesis here is that if we observe a location, as 
either source or destination in the driving sequence, pe- 
riodically associated with an IP address, then this IP ad- 
dress is likely at that location. To extract such association 
from the dataset, we employ a series of strict heuristics 
as follows. 

We first exclude IP addresses associated with multi- 
ple browser agents. This is because it is unclear whether 
this IP address is used by only one user with multiple 
browsers or by different users. We then select IP ad- 
dresses for which a single location appears at least four 
times in each of the three months, since such IP addresses 
with ’stable’ search records are more likely to provide ac- 
curate geolocation information than the ones with only a 
few search records. We further remove IP addresses that 
are associated with two or more locations that appear at 
least four times. Finally we remove all ’dead’ IPs from 
the remaining dataset. 


4.1.4 Dataset characteristics 


Here, our goal is to explore the characteristics of the lo- 
cations where the IP addresses of the three datasets are. 
'We respect a request of this online map service company and do 


not disclose the number of requests and collected IPs here and in the 
rest of the paper. 
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Figure 8: Comparison of error distances of three datasets 


In particular, population density is an important param- 
eter that indicates the rural vs. urban nature of the area 
in which an IP address resides. We will demonstrate be- 
low that this parameter influences the performance of our 
method, since urban areas typically have a large number 
of web-based landmarks. 

Figure 7 shows the distribution of the population den- 
sity of the ZIP Code at which the IP addresses of the 
three datasets locate. We obtain the population density 
for each ZIP Code by querying the website City Data [1]. 
Figure 7 shows that our three datasets cover both rural 
areas, where the population density is small, and urban 
areas, where the population density is large. In particu- 
lar, all three datasets have more than 20% of IPs in ZIP 
Codes whose population density is less than 1,000. The 
figure also shows that PlanetLab dataset is the most ’ur- 
ban’ one, while the Online Maps datasets has the longest 
presence in rural areas. In particular, about 18% of IPs 
in the Online Maps dataset reside in ZIP Codes whose 
population density is less than 100. 


4.2 Experimental results 
4.2.1 Baseline results 


Figure 8 shows the results for the three datasets. In par- 
ticular, it depicts the cumulative probability of the error 
distance, i.e., the distance between a target’s real location 
and the one geolocated by our system. Thus, the closer 
the curve is to the upper left corner, the smaller the error 
distance, and the better the results. The median error for 
the three datasets, a measure typically used to represent 
the accuracy of geolocation systems [15, 17,24], are 0.69 
km for Planetlab, 2.25 km for the residential dataset, and 
2.11 km for the online maps dataset. Beyond excellent 
median results, the figure shows that the tail of the dis- 
tribution is not particularly long. Indeed, the maximum 
error distances are 5.24 km, 8.1 km, and 13.2 km for 
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Figure 9: Landmark density of three datasets 


Planetlab, residential, and online maps datasets, respec- 
tively. The figure shows that the performances of the res- 
idential and online maps datasets are very similar. This 
is not a surprise because the online maps dataset is dom- 
inated by residential IPs. On the other hand, our system 
achieves clearly better results in the Planetlab scenario. 
We analyze this phenomenon below. 


4.2.2 Landmark density 


Here, we explore the number of landmarks in the prox- 
imity of targeted IPs. The larger the number of land- 
marks we can discover in the vicinity of a target, the 
larger the probability we will be able to more accurately 
geolocate the targeted IP. We proceed as follows. First, 
we count the number of landmarks in circles of radius r, 
which we increase from 0 to 6km, shown in Figure 9. 
Then, we normalize the number of landmarks for each 
radius relative to the total number of landmarks seen by 
all three datasets that fit into the 6 km radius. Because 
of such normalization, the normalized number of targets 
for x = 6km sum up to 1. Likewise, due to normaliza- 
tion, the value on y-axis could be considered the land- 
mark density. 

Figure 9 shows the landmark density for the three 
datasets as a function of the radius. The figure shows 
that the landmark density is largest in the Planetlab case. 
This is expected because one can find a number of Web- 
based landmarks on a University campus. This certainly 
increases the probability of accurately geolocating IPs in 
such an environment, as we demonstrated above. The 
figure shows that residential targets experience a lower 
landmark density relative to the Planetlab dataset. At the 
same time, the online maps dataset shows an even lower 
landmark density. As shown in Figure 7, our residen- 
tial dataset is more biased towards urban areas. On the 
contrary, the online maps provide a more comprehensive 
and unbiased breakdown of locations. Some of them are 
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rural areas, where the density of landmarks is naturally 
lower. In summary, the landmark density is certainly a 
factor that clearly impacts our system’s geolocation ac- 
curacy. Still, additional factors such as access network 
level properties do play a role, as we show below. 


4.2.3 Global landmark density 


To understand the global landmark density (more pre- 
cisely, US-wide landmark density), we evenly sample 
18,000 ZIP Codes over all states in US. Figure 10 shows 
that there are 79.4% ZIP Codes which contain at least 
one landmark within the ZIP Code. We manually check 
the remaining ZIP Codes and realize that they are typ- 
ically the rural areas, where local entities, e.g., busi- 
nesses, are rare naturally. Nonetheless, for 83.78% of 
ZIP Codes, we are capable of finding out at least one 
landmark in its vicinity of 6 km; for 88.51% of ZIP 
Codes, we are always able to discover at least one land- 
mark in its vicinity of 15 km; finally, for 93.44% of ZIP 
Codes, we find at least one landmark in its vicinity of 30 
km. 

We make the following comments. First, Figure 10 
can be used to predict US-wide performance of our 
method from the area perspective. For example, it shows 
that for 6.6% of the territory, the error can only be larger 
than 30 km. Note, however, that such areas are extremely 
sparsely populated. For example, the average population 
density in the 6.6% of ZIP Codes that have no landmark 
within 30 km is less than 100. Extrapolating conserva- 
tively to the entire country, it can be computed that such 
areas account for about 0.92% of the entire population. 


4.2.4 The role of population density 


Here, we return to our datasets and evaluate our system’ 
S performance, i.e., error distance, as a function of pop- 
ulation density. For the sake of clarity, we merge the 
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Figure 11: Error distance vs. population density 


results of the three datasets. Figure 11 plots the best fit 
curve that captures the trends. It shows that the error dis- 
tance is smallest in densely populated areas, while the 
error grows as the population density decreases. This re- 
sult is in line with our analysis in Section 4.2.3. Indeed, 
the larger the population density is, the higher probabil- 
ity we can discover more landmarks. Likewise, as shown 
in Section 4.2.2, the more landmarks we can discover in 
the vicinity of targeted IP address, the higher probability 
we can more accurately geolocate the targeted IP. Finally, 
the results show that our system is still capable of geolo- 
cating IP addresses in rural areas as well. For example, 
we trace the IP that shows the worst error of 13.2km. 
We find that this is an IP in a rural area with no land- 
marks discovered within the ZIP Code, which has a pop- 
ulation density of 47. The landmark with the minimum 
measured distance is 13.2 km away, which our system 
selected. 


4.2.5  Therole of access networks 


Contrary to the academic environment, a number of res- 
idential IP addresses access the Internet via DSL or ca- 
ble networks. Such networks create the well-known last- 
mile delay inflation problem, which represents a funda- 
mental barrier to methods that rely on absolute delay 
measurements. Because our method relies on relative 
delay measurements, it is highly resilient to such prob- 
lems, as we show below. To evaluate this issue, we ex- 
amine and compare our system’s performance for three 
different residential network providers that we collected 
in Section 4.1.2. These are AT&T, Comcast, and Veri- 
zon. 

Figure 12 shows the CDF of the error distance for the 
three providers. The median error distance is 1.48 km 
for Verizon, 1.68 km for AT&T, and 2.38 km for Com- 
cast. Thus, despite the fact that we measure significantly 
inflated delays in the last mile, we still manage to geolo- 
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cate the endpoints very accurately. For example, a delay 
of 5 ms [13] that we commonly see at the last mile could 
place a scheme relying on absolute delay measurements 
700 km away from the target. Our approach effectively 
addresses this problem and geolocates the targets within 
a few kilometers. 

Figure 12 shows that our method has reduced perfor- 
mance for Comcast targets, who show a somewhat longer 
tail than the other two providers. We explore this issue in 
more depth. According to [7], AT&T and Verizon offer 
DSL services. Comcast is dominantly a cable Internet 
provider, and offers DSL only in a smaller number of ar- 
eas. As demonstrated in [13], cable access networks have 
a much larger latency variance, which may rapidly vary 
over short time scales, than DSL networks. While our 
relative delay approach is resilient to absolute delay in- 
flation at the last mile, it can still be hurt by measured de- 
lay variance. Because latency in cable networks changes 
over short time scales, it blurs our measurements, which 
are not fully synchronized. Hence, the landmarks’ rela- 
tive proximity estimation gets blurred, which causes the 
effects shown in the figure. In particular, the median er- 
ror distance of the cable case increases by approximately 
700 meters relative to the DSL case (shown by the ar- 
row from AT&T to Comcast in the middle of Figure 12), 
while the maximum error distance increases by 2 km 
(shown by the arrow from AT&T to Comcast at the top 
of Figure 12). 


5 Discussion 


Measurement overhead. Our methodology incurs mea- 
surement overhead due to Web crawling and network 
probing. Still, it is capable of generating near real-time 
responses, as we explain below. To geolocate an IP ad- 
dress, we crawl Web landmarks for a portion of ZIP 
Codes on the fly, as we explained in Sections 2.2 and 


2.3. It is important to understand that this is a one-time 
overhead per ZIP Code because we cache all landmarks 
for every ZIP Code that we visit. Thus, when we want to 
geolocate other IP addresses in the vicinity of a previous 
one, we reuse previously cached landmarks. Once this 
dataset is built, only occasional updates are needed. This 
is because the Web-based landmarks we use are highly 
stable and long-lived in the common case. 


On the network measurement side, we generate con- 
current probes from multiple vantage points simultane- 
ously. In the first tier, we need 2 RTTs (1 RTT from 
the master node to the vantage points, and | RTT for the 
ping measurements). In the second and third tiers each, 
the geolocation response time per IP can be theoretically 
limited by 3 round-trip times (1 RTT from the master 
node to the measurement vantage points, and 2 RTTs for 
an advanced traceroute overhead’). Thus, the total over- 
head on the network measurement side is 8 RTTs, which 
typically translates to a 1-2 seconds delay. 


Migrating web services to the cloud. Cloud services 
are thriving in the Internet. One might have a concern 
that this might dramatically reduce the number of land- 
marks that we can rely upon. We argue that this is not the 
case. While more websites might indeed be served on 
the cloud, the total number of websites will certainly in- 
crease over time. Even if the large percent of the websites 
will end up in the cloud, the remaining percent of web- 
sites will always create a reliable and accurate backbone 
for our method. Moreover, even when an entity migrates 
a Web site to the cloud, the associated e-mail exchange 
servers do remain hosted locally (results not shown here 
due to space constraints). Hence, such servers can serve 
as accurate geolocation landmarks. Our key contribution 
lies in demonstrating that all such landmarks (i.e., Web, 
e-mail, or any other) can be effectively used for accurate 
geolocation. 


International coverage. Our evaluation is limited to 
US simply as we were able to obtain the vast majority of 
the ground-truth information from within the US. Still, 
we argue that our approach can be equally used in other 
regions as well. This is because other countries such as 
Canada, UK, China, India, South Korea efc., also have 
their own “ZIP Code” systems. We are currently adyjust- 
ing our system so that it can effectively work in these 
countries. Moreover, we expect that our approach will 
be applicable even in regions with potentially poor net- 
work connectivity. This is because our relative-delay- 
based method is insensitive to inflated network latencies 
characteristic for such environments. 


2Tn the advanced traceroute case, 1 RTT is needed to obtain the IPs 
of intermediate routers, while another RTT is needed to simultaneously 
obtain round-trip time estimates to all intermediate routers by sending 
concurrent probes. 
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6 Related work 


6.1 Client-independent IP geolocation sys- 
tems 


6.1.1 Data mining-based 


DNS-based. Davis et al. [12] propose a DNS-based ap- 
proach, which suggests adding location segments in the 
format of a Resource Record (RR). Nevertheless, such 
modification can not be easily deployed in practice and 
the administrators have little incentive to register or mod- 
ify new RRs. Moreover, Zhang et al. [25] have demon- 
strated that DNS misnaming is common, and that it can 
distort Internet topology mapping. 

Whois-based. Moore et al. [18] argue that geoloca- 
tion can also be obtained by mining the Whois database. 
However, as the authors themselves pointed out, large en- 
tities with machines dispersed in different locations can 
register their domain names with the geographical loca- 
tion of their headquarters. As an example, many exist- 
ing IP geolocation databases that use this approach incor- 
rectly locate all Google’s servers worldwide to Mountain 
View, CA. 

Hostname-based. The machine hostnames can some- 
times indicate the geolocation information. In particu- 
lar, Padmanabhan’s and Subramanian’s GeoTrack [19] 
parses the location of the last access router towards the 
target to be located from its hostname and uses the loca- 
tion of this router as that of the target. Unfortunately, this 
method can be inhibited by several factors, as pointed 
by [14]. First, not all machine names contain geolocation 
associating information. Second, administrators can be 
very creative in naming the machines; hence, parsing all 
kinds of formats becomes technically difficult. Finally, 
such last hop location substitution can incur errors. 

Web-based. Guo et al.’s [16]’s Structon, mines the 
geolocation information from the Web. In particular, 
Structon builds a geolocation table and uses regular ex- 
pressions to extract location information from each web 
page of a very large-scale crawling dataset. Since Struc- 
ton does not combine delay measurement with the land- 
marks it discovers, it achieves a much coarser (city-level) 
geolocation granularity. For example, they extract all lo- 
cation keywords from a web page rather than just the lo- 
cation address. Likewise, they geolocate a domain name 
by choosing one from all locations provided by all the 
web pages within this domain name. Indeed, such ap- 
proaches are error prone. Moreover, geolocating a /24 
segment with a city blurs the finer-grained characteris- 
tics of each IP address in this segment. 

Other sources. Padmanabhan’s and Subramanian’s 
GeoCluster [19] geolocates IP addresses into a geograph- 
ical cluster by using the address prefixes in BGP rout- 
ing tables. In addition, by acquiring the geolocation in- 
formation of some IP addresses in a cluster from pro- 
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prietary sources, é.g., users’ registration records in the 
Hotmail service, GeoCluster deduces the location of this 
entire cluster. This method highly depends on the cor- 
rectness of users’ input and the private location infor- 
mation, which is in general not publicly available. Our 
approach differs from GeoCluster in that web designers 
have strong incentive to report correct location informa- 
tion in their websites, while users are less likely to pro- 
vide accurate location information in their registration 
application with online services, on which GeoCluster 
highly relies. Moreover, we have demonstrated that us- 
ing active network measurements instead of extrapolat- 
ing geo information to entire clusters, is far more accu- 
rate. 


6.1.2 Delay measurement-based 


GeoPing. Padmanabhan and Subramanian design GeoP- 
ing [19], which assumes that two machines that have 
similar delay vectors tend to be close to each other. The 
authors rely on a set of active landmarks, i.e., those capa- 
ble of actively probing the target. Necessarily, the accu- 
racy of such an approach (the comparable results shown 
later in the text) depends on the number of active land- 
marks, which is typically moderate. 

CBG. Instead of yielding a discrete single geo point, 
Gueye et al. [15] introduce Constraint Based Geoloca- 
tion (CBG), a method that provides a continuous geo 
space by using multilateration with distance constraints. 
In particular, CBG first measures the delays from all van- 
tage points to the target. Then, it translates delays into 
distance by considering the best network condition of 
each vantage point, termed bestline. Finally, it returns 
a continuous geo space by applying multilateration. 

CBG uses bestline constraints to compensate for the 
fact that Internet routes are sometimes undirected or in- 
flated. However, due to the difficulty of predicting the di- 
rectness of a network route from a vantage point to a tar- 
get, CBG only works well when the target is close to one 
of the vantage points. As explained above, we use the 
CBG approach straightforwardly in our tier 1 phase to 
discover the coarse-grained area for a targeted IP. More- 
over, using newly discovered web landmarks in this area, 
we further constrain the targeted area in the tier 2 phase 
as well. Thus, while CBG is good at limiting the destina- 
tion area, it is inherently limited in its ability to achieve 
very fine-grained resolution due to measurement inaccu- 
racies. 

TBG. Taking the advantage of the fact that routers 
close to the targets can be more accurately located, Katz- 
Bassett et al. [17] propose Topology-based Geolocation 
(TBG), which geolocates the target as well as the routers 
in the path towards the target. The key contribution of 
this work lies in showing that network topology can be 
effectively used to achieve higher geolocation accuracy. 
In particular, TBG uses the locations of routers in the in- 
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terim as landmarks to better quantify the directness of the 
path to the target and geolocate it. 


In addition to using network topological information, 
a TBG variant also takes advantage of passive landmarks 
with known locations. However, such an approach is 
constrained by the fact that it only has a very limited 
number of such landmarks. On the contrary, our web- 
based technique can conquer this difficulty significantly 
by discovering a large number of web-based landmarks. 
More substantially, TBG fundamentally relies on the ab- 
solute delay measurements, which are necessarily inac- 
curate at short distances. On the contrary, in addition to 
relying on a large number of web-based landmarks in an 
area, we demonstrate that our relative distance approach, 
while technically less attractive, is far more accurate. 


Octant. Wong et al. [24] propose Octant, which con- 
siders the locations of intermediate routers as landmarks 
to geolocate the target. Further, Octant considers both 
positive information, the maximum distance that a tar- 
get may be from the landmark, and negative information, 
the minimum distance this target may be from the land- 
mark. In addition to delay-based constraints, Octant also 
enables any kind of positive and negative constraints to 
be deployed into its system, e.g., the negative constraints 
(oceans and uninhabitable areas) obtained from geogra- 
phy and demographics. 


In attempt to achieve high accuracy, Octant (as well 
as the above TBG method) also adopts the locations of 
routers in the path to the destination as landmarks to ge- 
olocate the target. However, such an approach is ham- 
pered to reach finer-grained accuracy because it fails to 
accurately geolocate routers at such precision in the first 
place. Finally, while Octant ’pushes’ the accuracy of 
delay-based approaches to an absolutely admirable limit, 
it is incapable of achieving a higher precision simply due 
to the inherent inaccuracies associated with absolute de- 
lay measurements. 


Comparative results. According to [17], TBG has 
the median estimation error of 67 km that a factor of 
three outperforms CBG with the median estimation error 
of 228 km. According to [24], comparing with GeoP- 
ing and CBG, Octant with a median estimation error of 
22 miles is three times better than GeoPing with an esti- 
mation error of 68 miles and four times better than CBG 
with an error distance of 89 miles respectively. Because 
TBG and Octant used the PlanetLab nodes to evaluate 
their system’s accuracy, we can directly compare them 
with our system. As outlined above, our system’s me- 
dian error distance is 50 times smaller than Octant’s, and 
approximately 100 times smaller than TBG’s. 


6.2 Client-dependent IP geolocation sys- 
tems 


6.2.1 Wireless geolocation 


GPS-based geolocation Global Positioning System 
(GPS) devices, that have been embedded into billions of 
mobile phones and computers at nowadays, could pre- 
cisely provide user’s location. However, GPS technology 
differs from our geolocation strategy in the sense that it 
is a client-side’ geolocation approach, which means that 
the server does not know where the user is, unless the 
user explicitly reports his information back to the server. 

Cell tower and Wi-Fi -based geolocation. Google 
My Location [5] and Skyhook [9] introduced their cell 
tower-based and Wi-Fi -based geolocation approaches. 
In particular, the cell tower-based geolocation offers 
users estimated locations by triangulating from cell tow- 
ers surrounding users, while the Wi-Fi-based geolocation 
uses Wi-Fi access point information instead of cell tow- 
ers. Specifically, every tower or Wi-Fi access point has 
a unique identification and footprint. To find a user’s ap- 
proximate location, such methods calculate user’s posi- 
tion relative to the unique identifications and footprints 
of nearby cell towers or Wi-Fi access points. 

Such methods could provide accurate results, e.g., 200 
- 1000 meters accuracy in cell tower scenario, and 10-20 
meters in Wi-Fi scenario [9], on the expense of sacrific- 
ing the geolocation availability at three aspects. 

First, these approaches require end user’s permission 
to share their location. However, as we discussed above, 
many applications such as location-based access restric- 
tions, context-aware security, and online advertising, 
can not rely on client’s support for geolocation. Sec- 
ond, companies utilizing such an approach must deploy 
drivers to survey every single street and alley in tens of 
thousands of cities and towns worldwide, scanning for 
cell towers and Wi-Fi access points, as well as plotting 
their geographic locations. However, in our approach, 
we avoid such ’heavy’ overhead by lightly crawling land- 
marks from the Web. Third, these approaches are tailored 
towards mobile phones and laptops. However, there are 
many devices (IPs) bound with wired network on the In- 
ternet. Such wireless geolocation methods are necessar- 
ily incapable of geolocating these IPs, while our method 
does not require any precondition on the end devices and 
IPs. 


6.2.2 W3C geolocation 


A geolocation API specification [3] is going to become 
a part of HTML 5 and appears to be a part of current 
browsers already [6].This API defines a high-level in- 
terface to location information, and is agnostic of the 
underlying location information sources. The underly- 
ing location database could be collected and calculated 
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by GPS, Wi-Fi access point, cell tower, RFID, Bluetooth 
MAC address, as well as IP address, associated with the 
devices. Again, this approach requires end users’ col- 
laboration for geolocation. In addition, this method also 
requires browser compatibility, e.g., Web browser must 
supports HTML 5. Finally, to geolocate wired devices, 
W3C geolocation has to conduct IP address-based ap- 
proaches discussed in Section 6.1.1 and Section 6.1.2. In 
this case, our method can be considered as an effective 
alternative to improve the accuracy. 


7 Conclusions 


We have developed a client-independent geolocation sys- 
tem able to geolocate IP addresses with more than an 
order of magnitude better precision than the best previ- 
ous method. Our methodology consisted of two powerful 
components. First, we utilized a system that effectively 
harvest geolocation information available on the Web to 
build a database of landmarks in a given ZIP Code. Sec- 
ond, we employed a three tiered system that begins at 
a large, coarse-grained, scale and progressively works 
its way to a finer, street-level, scale. At each stage, it 
takes advantage of landmark data and the fact that on 
the smaller-scale, relative distances are preserved by de- 
lay measurements, overcoming many of fundamental in- 
accuracies encountered in the use of absolute measure- 
ments. By combining these we demonstrated the effec- 
tiveness of using both active delay measurements and 
web-mining for geo-location purposes. 

We have shown that our algorithm functions well 
in the wild, and is able to locate IP addresses in the 
real world with extreme accuracy. Additionally, we 
demonstrated that our algorithm is widely applicable 
to IP addresses from both academic institutions, a 
collection of residential addresses, as well as a larger 
mixed collection of addresses. The high accuracy of 
our system in a wide range of networking environments 
demonstrates its potential to dramatically improve the 
performance of existing location-dependent Internet 
applications and to open the doors to novel ones. 
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